1. 13 Sep, 2024 1 commit
    • Jun Liu's avatar
      Customize filesystem in CK for legacy systems (#1509) · 81bc1496
      Jun Liu authored
      
      
      * Legacy support: customized filesystem
      
      * Update cmakefile for python alternative path
      
      * fix build issues
      
      * CK has no boost dependency
      
      * More fixes to issues found on legay systems
      
      * fix clang format issue
      
      * Check if blob is correctly generated in cmake
      
      * fix the python issues
      
      * add a compiler flag for codegen when using alternative python
      
      * use target_link_options instead of target_compile_options
      
      ---------
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      81bc1496
  2. 12 Sep, 2024 2 commits
  3. 11 Sep, 2024 2 commits
  4. 10 Sep, 2024 1 commit
  5. 09 Sep, 2024 1 commit
  6. 07 Sep, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm example (#1488) · caacd388
      Thomas Ning authored
      
      
      * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
      
      * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
      
      * Fix: Clang Format, API fixed from fmha
      
      * fix with better naming convention
      
      * revert back the pipeline code of fmha
      
      * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
      
      * clang format with the reference_gemm file
      
      * convert the clang format with the remod.py
      
      * Changed the format and variable name of the kernel gemm_shape and partitioner
      
      ---------
      Co-authored-by: default avatarthomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
      caacd388
  7. 05 Sep, 2024 2 commits
  8. 04 Sep, 2024 3 commits
  9. 03 Sep, 2024 1 commit
  10. 02 Sep, 2024 1 commit
  11. 30 Aug, 2024 2 commits
  12. 29 Aug, 2024 1 commit
  13. 28 Aug, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Add PagedAttention kernels (#1387) · c1569892
      Po Yen Chen authored
      
      
      * Use dictionary to config all the functions
      
      * Add init codegen logic for fmha fwd appendkv
      
      * Call HIP_CHECK_ERROR() macro to get real source info
      
      * Setup meaningfull arguments
      
      * Sync kernel name with the codegen
      
      * Add knew/vnew tensors to the kernel argument
      
      * Fix wrong K values after appending
      
      * Fix vnew append errro
      
      * Extract common logics
      
      * Fix Vnew tile dstr for row major case
      
      * Conditionally add fwd_splitkv API in fmha_fwd example
      
      * Conditionally add call to fmha_fwd_splitkv()
      
      * Remove "EXAMPLE_" prefix of cmake variables
      
      * Regsiter API handlers automatically
      
      * Early return if 0 < s_k_new is not supported
      
      * Show message if we are ignoring option
      
      * Unify CMakeLists.txt coding style
      
      * Set num_splits=1 if split-kv is not supported
      
      * Add length/stride getters for HostTensor
      
      * Add RoPE example utilities
      
      * Add reference_rotary_position_embedding() (not implemented)
      
      * Finish reference_rotary_position_embedding() impl
      
      * Fix typo of HostTensor<>::get_length()
      
      * Fix compilation errors
      
      * Fix wrong answer when interleaved=false
      
      * Fix wrong answer when interleaved=true
      
      * Append K/V in the host verification code
      
      * Simplify K appending logics
      
      * Simplify v_host_ref definition
      
      * Reduce input/output dimensions
      
      * Rename function: add "batched" prefix
      
      * Apply RoPE on host side
      
      * Rename RoPE utility function
      
      * Fix wrong tensor size
      
      * Avoid invoking deprecated method 'find_module'
      
      * Pass RoPE kernel args
      
      * Create Rotary Cos/Sin tile windows in kernel
      
      * Add compute data type alias for RoPE
      
      * Randomly generate seqlen_knew if needed
      
      * Fix seqlen_knew enabling check logic
      
      * Add minimum seqlen_k to generate compliance kvcache
      
      * Fix compilation error in debug mode
      
      * Fix wrong boundaries
      
      * Fix wrong seqlen_k for kvcache
      
      * Rename variables used in distributio encoding
      
      * Fix rotary cos/sin tensor/tile size
      
      * Add constraint to the rotary_dim option
      
      * Remove unused inner namespace
      
      * Add dram distribution for rotary_cos/rotary_sin (interleaved)
      
      * Only apply interleaved RoPE on Knew for now
      
      * Fix wrong thread starting offset
      
      * Instantiate multiple kernels for RoPE approaches
      
      * Clean-up pipeline
      
      * Fix error in RoPE host reference
      
      * Handle RoPE half-rotated logics
      
      * Support 8x rotary_dim under half-rotated RoPE
      
      * Add comment
      
      * Apply elementwise function to the loaded tiles
      
      * Unify parameter/variable naming style
      
      * Remove constness from q_ptr
      
      * Add code blocks for q_tile
      
      * Apply RoPE to q_tile
      
      * Remove debug print code in kernel
      
      * Fix wrong knew/vnew appending positions
      
      * Use better naming for tile indices
      
      * Add make_tile_window() for adding distribution only
      
      * Skip code if # of block is more than needed
      
      * Move thread locating logics into policy
      
      * Remove always true static_assert()
      
      * Rename header
      
      * Rename RotaryEmbeddingEnum
      
      * Extract rotary embedding logic out
      
      * Re-order parameters
      
      * Align naming of some tile size constants
      
      * Rename more tile size constants
      
      * Fix wrong grid size
      
      * Fix wrong shape of knew_host/vnew_host
      
      * Fix wrong index into knew_host/vnew_host
      
      * Fix wrong rotary_cos/rotary_sin memory size for Q
      
      * Extract Q/Knew vector size to helper methods
      
      * Use different rotary_cos/rotary_sin distr for Q/Knew
      
      * Update host/device specifiers
      
      * Fix wrong data type for Q rotary_cos/rotary_sin
      
      * Remove RoPEComputeDataType type alias
      
      * Shift rotary_cos/rotary_sin by cache_seqlen_k
      
      * Add comment for why I just 't' for all padding flags
      
      * Align commit message to the real comment
      
      * Fix wrong pipeline
      
      * Rename utility function
      
      * Disable host verification if API not exist
      
      * Fix wrong rope key for fp8 pipeline
      
      * Allow only apply RoPE on Q (without append KV)
      
      * Add append-kv smoke tests
      
      * Remove debug statements
      
      * Remove more debug statements
      
      * Re-arrange the 'set +x' command
      
      * Remove no-longer used method in pipeline
      
      * Add missing init code
      
      * Refine pipeline padding settings
      
      * Enlarge rotary_dim limit (8 -> 16)
      
      * Enlarge KPerThread for rotary_interleaved=false
      
      * Update rotary_dim range in smoke_test_fwd.sh
      
      * Add template argument 'kIsPagedKV' for splitkv kernels
      
      * Launch splitkv kernel if given page_block_size
      
      * Fix wrong kernel name
      
      * Fix seqlen_k_min for pre-fill case (1 -> 0)
      
      * Add copy_const<> type trait
      
      * Add another make_tile_window()
      
      * Introduce 'TileWindowNavigator' types
      
      * Simplify TileWindowNavigator interfaces
      
      * Fix tile window navigation bugs
      
      * Disable calling fmha_fwd()
      
      * Remove ununnecessary data members
      
      * Simplify more make_tile_window() overloads
      
      * Move V tile through TileWindowNavigator
      
      * Fix uneven split checking logic
      
      * Move code after decide seqlen_q/seqlen_k
      
      * Make sure we always start reading complete tile
      
      * Use 128 as minimus page_block_size
      
      * Fix wrong origin for bias
      
      * Add batch_stride_k/batch_stride_v in group mode
      
      * Unify origin
      
      * Add missing kernel arguments for group mode
      
      * Add paged-kv codegen logic for appendkv kernels
      
      * Add block_table kernel args for appendkv kernel
      
      * Add tile navigators to the appendkv kernel
      
      * Fix wrong tensor descriptor lengths
      
      * Pass re-created tile window to pipeline
      
      * Fix wrong strides for appendkv kernel
      
      * Allow transit tile_window to another page-block
      
      * Handle cross-page-block write
      
      * Donot perform write again if already in last page-block
      
      * Always add fmha_fwd() api
      
      * Add missing group mode argument
      
      * Remove debug macro usages
      
      * Rename option s_k_new to s_knew
      
      * Separate splitkv/non-splitkv args/traits
      
      * Remove fmha_fwd_dispatch()
      
      * Fix compilation errors
      
      * Remove dropout code in splitkv kernel
      
      * Allow problem types without define kHasDropout attr
      
      * Use generic lambda to init traits objects
      
      * Separate more non-splitkv & splitkv traits/args
      
      * Display more info for specific kernels
      
      * Show more detailed warning message
      
      * Rename 'max_num_blocks' to 'max_num_page_blocks'
      
      * Remove no-longer used pipeline files
      
      * Wrap code by #if directives
      
      * Move functors to the begining of validation code
      
      * Use generic lambda to init all the api traits/args
      
      * Fix wrong seqlen for kvcache
      
      * Add missing comment
      
      * Rename TileWindowNavigator to PageBlockNavigator
      
      * Only expose necessary methods (not attributes)
      
      * Re-order pipeline paremeters
      
      * Refine smoke_test_fwd.sh
      
      * Fix wrong arugment count
      
      * Make tile window directly via PageBlockNavigator
      
      * Remove unused template paremeter
      
      * Remove group mode from appendkv kernel
      
      * Fix skcheck logic
      
      * Fix wrong syntax in skcheck expr
      
      * Use meaningful options in smoke test
      
      * Remove options
      
      * Fix formatting
      
      * Fix more format
      
      * Re-organize bash functions
      
      * Pass cache_batch_idx to kernels
      
      * Support cache_batch_idx in example
      
      * Fix compilation error
      
      * Add more appendkv test
      
      * Add more case for appendkv
      
      * Fix unexisted attribute
      
      * Remove 0 < seqlen_knew constraint
      
      * Clarify the case in warning message
      
      * Remove macro checking
      
      * Force batch mode when invoking appendkv & splitkv apis
      
      * Fix mode overriding logics
      
      * Fix wrong parameter name
      
      * Randomize seqlen_k if use kvcache
      
      * Use randomized seqlen_k for kvcache
      
      * Avoid using too small rotary_cos & rotary_sin
      
      * Rename parameter
      
      * Add seqlen_q & seqlen_k rules
      
      * Add comment
      
      * Add more comments
      
      * Fix compilation errors
      
      * Fix typo in comment
      
      * Remove type argument
      
      * Avoid seqlen_k=0 for kvcache
      
      * Revert "Avoid seqlen_k=0 for kvcache"
      
      This reverts commit 21c4df89e416182e8e9bc78e67bd4b98dbb6c88d.
      
      * Fix wrong uneven split checking logics
      
      * Only randomize kvcache seqlen_k if 1 < batch
      
      * Return earlier if split is empty
      
      * Revert "Only randomize kvcache seqlen_k if 1 < batch"
      
      This reverts commit b9a4ab0d7e3c2beecc0fccafd2a13259dd06299c.
      
      * Re-order seqlen_k_start adjustment logics
      
      * Fix compilation errors
      
      * Re-format script
      
      * Find executable from folder automatically
      
      * Fix kvcache seqlen_k generating logic
      
      * Make comment more clear
      
      * Fix wrong knew/vew appending logic on host
      
      * Add s_barrier to sync threads
      
      * Revert "Add s_barrier to sync threads"
      
      This reverts commit d3f550f30c0a4d9df15c613015d5dff268d6746d.
      
      * Support only using 1 row of rotary_cos/rotary_sin
      
      * Rotate Q in different way
      
      * Unify tensor view creation logics
      
      * Fix wrong argument
      
      * Add mask to switch how we use the rotary_cos/sin
      
      * Move attr from traits to problem
      
      * Move has_mask to fmha_fwd_appendkv_args
      
      * Support use uint32_t as SAD operand in Alibi<>
      
      * Use sad_u32() in splitkv kernels
      
      * Store tensor views in PageBlockNavigator
      
      * Use stored tensor view to update tile windows
      
      * Enlarge tensor view size
      
      * Remove debug code
      
      * Fix wrong tensor view size
      
      * Wrap tensor view into PageBlockNavigator
      
      * Add DataType member to PageBlockNavigator
      
      * Remove unnecessary member functions
      
      * Refind macro use
      
      * Fix typo
      
      * Add blank line between directives and actual code
      
      * Re-format files
      
      * Remove type in comment
      
      ---------
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      c1569892
  14. 26 Aug, 2024 1 commit
    • Illia Silin's avatar
      Enable daily ninja build traces. (#1487) · 19d22e60
      Illia Silin authored
      * add ninja trace to CI builds
      
      * fix ninja trace logic
      
      * update the ninja trace logic in jenkins file
      
      * limit the number of threads to run ninja build
      
      * use ninja for installation after build
      
      * update the path to ninjatracing tool
      
      * use ninja to run check when using build trace
      
      * fix jenkins logic
      
      * fix typos
      
      * set proper setup_args for all stages
      
      * fix ninja syntax
      
      * replace ninja check with ninja test
      
      * enable ninja tracing with mainline and staging compilers
      19d22e60
  15. 23 Aug, 2024 1 commit
  16. 22 Aug, 2024 5 commits
  17. 21 Aug, 2024 2 commits
    • Andriy Roshchenko's avatar
      Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473) · c3515f27
      Andriy Roshchenko authored
      * Enable CMakePresets build
      
      * Verify Convolution, Scaling and ReLU algorithms.
      
      * Add tensor element-wise scale and type cast operation.
      
      * Reduction implemented but does not work.
      
      * Exploration of Reduction functionality.
      
      * Completed example for Convolution scaled with ReLu activation and AMAX reduction.
      
      * WIP: Add required instances for convolution.
      
      * WIP: Create client example. Implement convolution stage.
      
      * Add elementwise instances.
      
      * Add elementwise scale + convert example.
      
      * Add reduction instances.
      
      * WIP: Client example for AMAX reduction.
      
      * WIP: Add instances for multistage reduction.
      
      * WIP: Implementation of multistage reduction.
      
      * Refactoring.
      
      * Clean up.
      
      * Add CMakePresets.json
      
      * Guard off FP8 instances when the data type is not available.
      
      * Add example for Scaled FP8 Convolution with AMAX reduction.
      
      * Refactor CombConvScaleRelu instances.
      
      * Add CombConvScale instances.
      
      * Add client example for Scaled FP8 Convolution with AMAX reduction.
      
      * Cleanup.
      c3515f27
    • Rostyslav Geyyer's avatar
      Set RNE fp8 conversion as a default (#1458) · e20f20ef
      Rostyslav Geyyer authored
      * Set RNE fp8 conversion as a default
      
      * Update f8 tests
      
      * Disable failing test on gfx11
      
      * Update bf8 tests
      
      * Add a flag
      
      * Fix the flag
      
      * Raise flag for gfx10 as well
      
      * Temp commit for tolerance testing
      
      * Update tolerances
      e20f20ef
  18. 20 Aug, 2024 3 commits
  19. 19 Aug, 2024 1 commit
  20. 16 Aug, 2024 3 commits
    • Illia Silin's avatar
      Re-enable fp8 types for all architectures. (#1470) · c8b6b642
      Illia Silin authored
      * re-enable fp8 and bf8 for all targets
      
      * restore the fp8 gemm instances
      
      * re-enable conv_3d fp8 on all architectures
      
      * diasble several fp8 gemm instances on all architectures except gfx94
      
      * clang format fix
      c8b6b642
    • Dan Yao's avatar
      [CK_TILE] FA bwd kernels optimization (#1397) · 79a5d9c1
      Dan Yao authored
      
      
      * tmp save
      
      * fix batch deterministic bugs
      
      * fix group deterministic bugs
      
      * codegen update
      
      * reorder files
      
      * bias support
      
      * hd256 bias support
      
      * bwd smoke test update
      
      * simplify convert dq
      
      * fix hd256 dropout scratch
      
      * do{}while() -> while(){}
      
      * comments
      
      * remove FmhaBwdTilePartitioner
      
      * save clear_tile
      
      * refactor dropout
      
      * code cleanup
      
      * code cleanup
      
      * comments
      
      * fix epilogue problem
      
      * fix fwd dropout
      
      * group convert_dq opt
      
      * fix dq alignment
      
      * Do not store storerandval in bwd for flash attention integration
      
      * fix hd32 error and boost performance
      
      * revert
      
      * Remove duplicated WarpGemm definitions in the policy file
      
      * dropout patch for mrepeat 16*16
      
      * code sync up
      
      * dq_acc stride
      
      * dq_acc stride stuff
      
      * codegen update
      
      * fwd dropout revert
      
      * fix hd128 scratches and boost performance
      
      * receipt 3 for simplified smoke test
      
      * more strides for fa integration
      
      * fix hd64 scratches and boost performance
      
      * non-iglp pipeline for headdim padding cases
      
      * dpad same as dvpad for flash attention integration
      
      * unpadded lse&d for group mode
      
      * Support unpad layout for group lse
      
      * Support unpad lse layout for splitkv
      
      * Fix stride for splitkv kernel
      
      * fix unpadded lse issue in fwd splitkv
      
      * comment
      
      * solve lds read&write conflicts
      
      * rename
      
      * bias rename
      
      * tile index revert
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      79a5d9c1
    • Bartłomiej Kocot's avatar
      Add performance and large tensor tests for grouped conv (#1456) · 2581727d
      Bartłomiej Kocot authored
      
      
      * Add performance and large tensor tests for grouped conv
      
      * Resize tests
      
      * Resize tests
      
      * update the python script to parse the grouped_conv results
      
      * Remove int8 tests
      
      * change bwd wei layout
      
      ---------
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      2581727d
  21. 15 Aug, 2024 2 commits
  22. 14 Aug, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] gemm_universal related optimization (#1453) · 3049b546
      Haocong WANG authored
      
      
      * replace buffer_atomic with global_atomic
      
      * fixed global_atomic_add
      
      * added bf16 atomic_add
      
      * format
      
      * clang-format-12
      
      * clean
      
      * clean
      
      * add guards
      
      * Update gtest.cmake
      
      * enabled splitk_gemm_multi_d
      
      * format
      
      * add ckProfiler
      
      * format
      
      * fixed naming
      
      * format
      
      * clean
      
      * clean
      
      * add guards
      
      * fix clang format
      
      * format
      
      * add kbatch printout
      
      * clean
      
      * Add rocm6.2 related gemm optimization
      
      * Limit bf16 atomic usage
      
      * remove redundant RCR gemm_universal instance
      
      * Add RRR fp8 gemm universal instance
      
      * Bug fix
      
      * Add GPU_TARGET guard to FP8/BF8 target
      
      * bug fix
      
      * update cmake
      
      * remove all fp8/bf8 example if arch not support
      
      * Enable fp8 RRR support in ckProfiler
      
      * limit greedy-reverse flag to gemm_universal in ckProfiler
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@fb.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      3049b546
  23. 13 Aug, 2024 2 commits
    • AngryLoki's avatar
      Fix compilation errors with libc++ (#1461) · 50c42348
      AngryLoki authored
      
      
      This fixes 2 issues when compiled with libc++.
      
      First issue is attempt to call std::numeric_limits<ranges::range_value_t<_Float16>>::min().
      _Float16 is extension of libstdc++, it does not exist in C++ standard[2].
      Luckily, there is NumericLimits class in composable_kernel, which does everything needed.
      
      Second issue with call to 'check_err' is ambiguous: there are 2 candidates.
      It happens because composable_kernel relies on idea that f8_t (defined as _BitInt(8)) does not pass is_integral trait.
      However, libc++ treats _BitInt(N) as integral (per standard "any implementation-defined extended integer types" can be integral).
      
      Closes: #1460
      Signed-off-by: default avatarSv. Lockal <lockalsash@gmail.com>
      50c42348
    • Mateusz Ozga's avatar