1. 05 Jul, 2024 4 commits
  2. 04 Jul, 2024 1 commit
  3. 03 Jul, 2024 2 commits
  4. 27 Jun, 2024 2 commits
  5. 26 Jun, 2024 2 commits
    • Po Yen Chen's avatar
      [CK_TILE] fmha forward split-kv + combine kernels (#1338) · 0cb2e06d
      Po Yen Chen authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      * Add num_splits option and dummy split-kv api method
      
      * Generate fmha_fwd_splitkv()
      
      * Add SplitKV kernel codegen logics
      
      * Add SplitKV combine kernel codegen logics
      
      * Fix mismatched return type
      
      * Clean-up code
      
      * Replace sentinel value before storing
      
      * Fix wrong layout of LSE/LSEacc/Oacc
      
      * Format codes
      
      * Fix o_acc memory error
      
      * Fix wrong kBlockSize used in policy
      
      * Reduce # of combine kernels
      
      * Fix split-kv combine kernel name
      
      * Fix wrong LDS indexing logics
      
      * Fix wrong loop counter step logic
      
      * Undo vector size changes
      
      * Remove no-longer used field
      
      * Remove in-consistent comment
      
      * Remove debug statements in example
      
      * Remove more debug statements
      
      * Add constness to local variables
      
      * Clearn up generate.py
      
      * Fix unstable clang-format comment
      
      * Remove unused include directive
      
      * Use shorter template parameter name
      
      * Enable non-split-kv blobs
      
      * Update license date
      
      * Print num_splits conditionally
      
      * Undo disabling data types
      
      * Remove unnessary tile size for fp8
      
      * Fix wrong pipeline args for fp8
      
      * Fix example output format
      
      * Remove more debug code in combine pipeline
      
      * Add stride kernel arguments for LSE/O acc workspace
      
      * Re-order split-kv pipeline call operator arguments
      
      * Pass LSE/O strides in kernel argument
      
      * Re-order pipeline call operator arguments
      
      * Use tensor_descriptor to locate LSEacc elements
      
      * Support providing invalid element for tensor view
      
      * Set invalid element value for LSEacc tensor view
      
      * Remove hand-written store_tile() code
      
      * Remove necessary value-overwrite logic
      
      * Add transposed lds descriptor
      
      * Support load_tile() for tile_window_with_static_lengths<>
      
      * Undo removing necessary value-overwrite logic
      
      * Use read descriptor to locate lds elements
      
      * Simplify pipeline source code
      
      * Add constraint to kMaxSplits
      
      * Default use kMaxSplits=64 in generate.py
      
      * Revert "Add constraint to kMaxSplits"
      
      This reverts commit 0a2132d758042e6fb0292f4e354909b8a4d1c118.
      
      * Revert "Default use kMaxSplits=64 in generate.py"
      
      This reverts commit c7d9c80b77320aec6559222bed7d47adcaefe4e3.
      
      * Decide alignment by the padding parameter
      
      * Remove no-longer used utility functions
      
      * Remove not-working code
      
      * Add comment & remove no-longer used code
      
      * Fix computation errors
      
      * Add heuristic to override num_splits option
      
      * Add constraint to kMaxSplits
      
      * Fix compilation error
      
      * Clean up pipeline code
      
      * Wrap pointer access as lambda function
      
      * Rename confusing methods
      
      * Use kLogMasSplits as template parameter
      
      * Finish splitkv combine kernel codegen
      
      * Update kMaxSplits limit
      
      * Use smaller kM0 for splitkv combine kernel
      
      * Ignore droupout flag in splitkv pipeline
      
      * Unify flag usage
      
      * Add back flag kStoreLSE
      
      * Merge lambda calls in pipeline
      
      * Fix compilation errors
      
      * Avoid all empty splits
      
      * Always check for empty loop in splitkv pipelines
      
      * Re-order parameters
      
      * Remove redundant p_drop option check
      
      * Add traits/problem for fwd splitkv kernel
      
      * Conditionally enable uneven split boundary checks
      
      * Add comment for the splitkv traits field
      
      * Change even split criteria
      
      * Re-order statements
      
      * Refine occupancy value for hdim=128&256
      
      * Refine occupancy value for hdim=32&64
      
      * Remove redundant kernel argument
      
      * Separate fmha bwd codegen logics
      
      * Separate fmha fwd codegen logics
      
      * Remove redundant direction parameter in fwd&bwd codegen logics
      
      * Support generate multiple APIs for an example
      
      * Let 'api' an alias of 'direction' option
      
      * Remove choices for the 'direction' option
      
      * Use dictionary to config all the functions
      
      * Move fmha splitkv codegen logics to other file
      
      * Add fwd_splitkv api for tile_example_fmha_fwd
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      0cb2e06d
    • Harisankar Sadasivan's avatar
      66e0e909
  6. 25 Jun, 2024 4 commits
  7. 24 Jun, 2024 2 commits
  8. 22 Jun, 2024 1 commit
  9. 20 Jun, 2024 1 commit
  10. 18 Jun, 2024 1 commit
  11. 12 Jun, 2024 1 commit
  12. 10 Jun, 2024 1 commit
  13. 05 Jun, 2024 1 commit
    • Rostyslav Geyyer's avatar
      Add a scale op, related instances and examples (#1242) · cb0645be
      Rostyslav Geyyer authored
      
      
      * Add a scale op
      
      * Update the element op
      
      * Add instances
      
      * Add an example
      
      * Add a client example
      
      * Add a flag check
      
      * Revert flag check addition
      
      * Fix flag check
      
      * Update d strides in example
      
      * Update d strides in client example
      
      * Apply suggestions from code review
      
      Update copyright header
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Move the example
      
      * Move the client example
      
      * Update element op
      
      * Update example with the new element op
      
      * Add scalar layout
      
      * Update example
      
      * Update kernel for scalar Ds
      
      * Revert kernel changes
      
      * Update element op
      
      * Update example to use scales' pointers
      
      * Format
      
      * Update instances
      
      * Update client example
      
      * Move element op to unary elements
      
      * Update element op to work with values instead of pointers
      
      * Update instances to take element op as an argument
      
      * Update examples to use random scale values
      
      ---------
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      cb0645be
  14. 04 Jun, 2024 1 commit
    • Dan Yao's avatar
      CK Tile FA Training kernels (#1286) · 2cab8d39
      Dan Yao authored
      
      
      * FA fwd dropout
      
      * FA bwd
      
      * epilogue reuse
      
      * CMakeLists update
      
      * [CK_TILE] support alibi (#1269)
      
      * add alibi support
      
      * fix code
      
      * update code based on comment
      
      * Support more hdim
      
      * fix fp8 bias
      
      * support seqlen_k=0 case
      
      * remove unused printf
      
      * fix format
      
      ---------
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      
      * now fwd/bwd can build
      
      * bwd alibi
      
      * add bwd validation stream_config
      
      * update generated filenames
      
      * update bwd kernel launch
      
      * CK_TILE_HOST_DEVICE in philox
      
      * Transpose -> transpose
      
      * format
      
      * format
      
      * format
      
      * Generate the instance for FA required
      
      * format
      
      * fix error in WarpGemm
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarJing Zhang <jizhan@amd.com>
      2cab8d39
  15. 01 Jun, 2024 1 commit
    • zjing14's avatar
      Post-merge fix of PR 1300 (#1313) · 6fb1f4e0
      zjing14 authored
      * add f8 gemm with multiD for both row/col wise
      
      * change compute_type to fp8
      
      * changed tuning parameters in the example
      
      * add rcr example
      
      * post-merge fix
      
      * fix
      
      * reduce init range
      6fb1f4e0
  16. 28 May, 2024 2 commits
    • zjing14's avatar
      add f8 gemm multiD with both row/col wise scale (#1300) · 80db62f0
      zjing14 authored
      * add f8 gemm with multiD for both row/col wise
      
      * change compute_type to fp8
      
      * changed tuning parameters in the example
      
      * add rcr example
      80db62f0
    • carlushuang's avatar
      [CK_TILE] support group from cmdline (#1295) · 5055b3bd
      carlushuang authored
      * support cmdline seqlen decode
      
      * silent print
      
      * update readme
      
      * update kernel launch 3d
      
      * update tile partitioner
      
      * fix spill for bf16
      
      * modify based on comment
      
      * modify payload_t
      
      * fix bug for alibi mode
      
      * fix alibi test err
      
      * refactor kernel launch, support select timer
      
      * add missing file
      
      * remove useless code
      
      * add some comments
      5055b3bd
  17. 22 May, 2024 1 commit
  18. 11 May, 2024 1 commit
  19. 10 May, 2024 2 commits
  20. 09 May, 2024 1 commit
  21. 07 May, 2024 1 commit
  22. 30 Apr, 2024 1 commit
  23. 26 Apr, 2024 2 commits
    • Haocong WANG's avatar
      [GEMM] UniversalGemm update (#1262) · 764164b4
      Haocong WANG authored
      
      
      * Add bf16 instances
      
      * Add bf16 gemm universal example
      
      * tempsave
      
      * Add guard to navi compilation
      
      * workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
      
      * fix formatting condition statement issue
      
      * solve conflict
      
      ---------
      Co-authored-by: default avatarJun Liu <Liu.Jun@amd.com>
      764164b4
    • zjing14's avatar
      bf16A_Int8B with fastgelu/bias (#1264) · 0d0150db
      zjing14 authored
      * changed the copy function to v7r2
      
      * adding multi_abd
      
      * in-progress
      
      * add post-load oob check
      
      * debugging
      
      * adjust instances
      
      * add run_lds
      
      * add elemntwise_op
      
      * replace multi_abd_device with v3
      
      * clean up
      
      * clean
      
      * clean
      
      * Added LDSType
      
      * profiling
      
      * adjust oobcheck
      
      * add missing file
      
      * refactor
      
      * clean
      
      * add examples
      0d0150db
  24. 25 Apr, 2024 1 commit
    • Adam Osewski's avatar
      Grouped GEMM Multiple D tile loop. (#1247) · b4032629
      Adam Osewski authored
      * Overload output stream operator for LoopScheduler and PiplineVersion
      
      * Add Run overload accepting grid descriptors MK.
      
      * Add __device__ keyword for CalculateGridSize
      
      * Create device op GroupedGemmMultipleD
      
      * Add GroupedGemm MultipleD Tile Loop implementation.
      
      * Add an example for GroupedGemm MultipleD tile loop.
      
      * Device Op GroupedGEMMTileLoop.
      
      * Bunch of small changes in exmaple.
      
      * CkProfiler
      
      * Remove unused tparam.
      
      * Fix include statement.
      
      * Fix output stream overloads.
      
      * Do not make descriptors and check validity untill we find group.
      
      * Fix gemm desc initialization.
      
      * Revert device op
      
      * Fix compilation for DTYPES=FP16
      
      * Validate tensor transfers paramters.
      
      * Validate on host only NK dims if M is not known.
      
      * Fix bug.
      
      * A convenient debug func for selecting threads.
      
      * Fix has main k block loop bug.
      
      * Make sure that b2c has up to date tile offset.
      
      * Output stream operator for Sequence type.
      
      * Cmake file formatting.
      b4032629
  25. 19 Apr, 2024 1 commit
    • Bartłomiej Kocot's avatar
      Refactor elementwise kernels (#1222) · ad1597c4
      Bartłomiej Kocot authored
      * Refactor elementwise kernels
      
      * Instances fixes
      
      * Fix cmake
      
      * Fix max pool bwd test
      
      * Update two stage gemm split k
      
      * Restore elementwise scale for hiptensor backward compatiblity
      
      * Fix Acc data type check in conv fwd multiple abd
      
      * Disable conv fp64 fwd example
      
      * Update grouped conv weight multi d
      ad1597c4
  26. 18 Apr, 2024 1 commit
  27. 16 Apr, 2024 1 commit
    • zjing14's avatar
      Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978) · 12865fbf
      zjing14 authored
      
      
      * added an example grouped_gemm_multi_abd
      
      * fixed ci
      
      * add setElementwiseOp
      
      * changed API
      
      * clean code: add multiA into example
      
      * fixed v7r2 copy
      
      * add transpose
      
      * clean
      
      * fixed vector_load check
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * add reduce
      
      * testing
      
      * add example_b16_i8
      
      * refactor example
      
      * clean
      
      * add mpading
      
      * disable reduce for kbatch = 1
      
      * seperate reduce device op
      
      * add reduce op
      
      * add guard for workspace_size
      
      * add instances
      
      * format
      
      * fixed
      
      * add client example
      
      * add a colmajor
      
      * add instances
      
      * Update cmake-ck-dev.sh
      
      * Update profile_gemm_splitk.cpp
      
      * Update gridwise_gemm_xdlops_v2r4r2.hpp
      
      * format
      
      * Update profile_gemm_splitk.cpp
      
      * fixed
      
      * fixed
      
      * adjust test
      
      * adjust precision loss
      
      * adjust test
      
      * fixed
      
      * add bf16_i8 scale bias
      
      * fixed scale
      
      * fixed scale elementwise_op
      
      * revert contraction deviceop changes
      
      * fixed
      
      * Add AddFastGelu
      
      * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"
      
      This reverts commit 3b5d001efd74335b38dcb7d8c8877580b49d23a4, reversing
      changes made to 943199a99191661c5597c51ca8371a90bf57837e.
      
      * add Scales into elementwise
      
      * add gemm_multi_abd client example
      
      * add client examples
      
      * add rcr and crr
      
      * add grouped gemm client example
      
      * add grouped gemm client example
      
      * add instance for rcr crr
      
      * format
      
      * fixed
      
      * fixed cmake
      
      * fixed
      
      * fixed client_example
      
      * format
      
      * fixed contraction isSupport
      
      * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      
      * Update device_reduce_threadwise.hpp
      
      * clean
      
      * Fixes
      
      * Fix example
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizha@amd.com>
      Co-authored-by: default avatarBartłomiej Kocot <barkocot@amd.com>
      12865fbf