1. 09 Nov, 2024 1 commit
    • dummycoderfe's avatar
      Ck tile/moe sorting (#1624) · bec6fbc6
      dummycoderfe authored
      
      
      * add moe_sorting & check ok
      
      * fix comments & typo
      
      * Run remod.py under include/ck_tile & example/ck_tile directories
      
      * format codes
      
      * fix output ci check bug
      
      * fix moe sorting readme and error commit file
      
      * use magiv div to accelerate compute
      
      * add an loop unroll for moe lds ops
      
      * add extblocksnel to set zeros for moebufs
      
      * [Ck_tile] moe set zero run ok, add size check and fix ref check
      
      * [Ck_tile]fix moe_sorting fuse set_zero remod
      
      * [Ck_tile] change name style, fix zero buffer size err, change folder
      
      * [Ck_tile] moe_sorting: fix name style
      
      * [Ck_tile] moe_sorting, remove useless params in traits
      
      * [Ck_tile] change outputtile cnt * unit_size; change output buf alloc
      
      ---------
      Co-authored-by: default avatardummycoderfe <noplydummmycoder@163.com>
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      bec6fbc6
  2. 08 Nov, 2024 1 commit
  3. 07 Nov, 2024 1 commit
  4. 02 Nov, 2024 1 commit
  5. 01 Nov, 2024 2 commits
    • rocking's avatar
      [Ck_tile] smoothquant (#1617) · fbd65454
      rocking authored
      
      
      * fix compile error
      
      * fix typo of padding
      
      * Add smoothquant op
      
      * Add smoothquant instance library
      
      * refine type
      
      * add test script
      
      * Re-generate smoothquant.hpp
      
      * Always use 'current year' in copyright
      
      * use Generic2dBlockShape instead
      
      * Add vector = 8 instance back
      
      * Find exe path automatically
      
      * Simplify the api condition
      
      * Remove debugging code
      
      * update year
      
      * Add blank line between function declaration
      
      * explicitly cast return value to dim3
      
      * refine return value
      
      * Fix default warmup and repeat value
      
      * Add comment
      
      * refactor sommthquant cmake
      
      * Add README
      
      * Fix typo
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      fbd65454
    • carlushuang's avatar
      [layernorm] hot fix (#1620) · 550248de
      carlushuang authored
      * hot fix ln
      
      * some rename
      550248de
  6. 31 Oct, 2024 1 commit
    • carlushuang's avatar
      [CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c
      carlushuang authored
      * add prenorm/postnorm support, refactor using generate.py
      
      * update README
      
      * update README
      
      * fix format
      
      * update some description and fix format
      
      * update format
      
      * format
      
      * use non-raw for loading
      
      * format and update n4096
      
      * dynamic-quant ready
      
      * update readme
      
      * support fused dynamic-quant
      
      * update fused-quant, with smooth
      
      * update README
      
      * update args
      
      * update some based on comment
      c3a4800c
  7. 30 Oct, 2024 5 commits
    • Bartłomiej Kocot's avatar
      Remove virtual destructors from unary ops (#1610) · 9a8a5213
      Bartłomiej Kocot authored
      * Remove virtual destructors from unary ops
      
      * Fixes
      
      * Fixes
      
      * clang format fixes
      9a8a5213
    • rocking's avatar
      clang-format (#1612) · 7d911154
      rocking authored
      7d911154
    • Adam Osewski's avatar
      [CK-Tile] Universal gemm memory bound pipeline (#1558) · 24d996aa
      Adam Osewski authored
      * CK-Tile GEMM with memory bound pipeline.
      
      * Memory bound gemm pipeline.
      
      * Fix not closed namespace.
      
      * Block gemm mem pipeline draft.
      
      * Do not use ck_tile:: within ck_tile namespace.
      
      * Refactoring & Move Layout info to pipeline problem.
      
      * Get hot loop and TailNum information before lunching kernel.
      
      * Fixes in pipeline.
      
      * Add comment to load_tile_raw and change variable naming style.
      
      * Few small changes & formatting.
      
      * Do not use macro.
      
      * Add gtests.
      
      * Use AccDataType for Output of MFMA instruction.
      
      * Formatting.
      
      * Refactor gemm examples.
      
      * Switch over to current block gemm.
      
      * Use currently available pipeline policy.
      
      * Refactoring and review comment.s
      
      * Fixes after merge.
      
      * Add missing include.
      
      * Add load tile overload which accepts output tensor as parameter.
      
      * This give 8% perf boost at the cost of using more registers.
      
      * Rename example.
      
      * Small changes.
      
      * Fix compilation err and lower K.
      
      * Support different layouts for A/B
      
      * Fix vector size for different layouts.
      
      * Rename Alignment into VectorSize
      
      * Unblock tests.
      24d996aa
    • rocking's avatar
      [Ck tile] support rmsnorm and related fusion (#1605) · 3d609534
      rocking authored
      * Add reduce2d new api
      
      * Prevent user use cross warp reduction
      
      * Fix bug of std caculation
      
      * Add rmsnorm2d
      
      * Add rmsnorm small example
      
      * Remove static assert to prevent compile fail
      
      * Add script to test performance and correctness
      
      * Add missing cmake change
      
      * refine naming
      
      * refine example of rmsnorm
      
      * Fix bug of rmsnorm
      
      * Refine naming
      
      * Fix cmake
      
      * clang format
      
      * Refine pipeline name
      
      * Add add_rmsnorm2d_rdquant kernel
      
      * Add reduce op
      
      * host verification
      
      * Fix bug of one pass pipeline
      
      * Refine tile size
      
      * Add two pass pipeline
      
      * Rename two pass to three pass
      
      * Fix bug of kSaveX == false
      
      * Add instance library
      
      * Add test script
      
      * Fix bug of x verification
      
      * Add save_x to trait
      
      * Add README
      
      * Move reduce2d into reduce folder
      
      * Fix bug of welford when number of m warp > 1
      
      * remove reduncant comment
      
      * 1. move 06_rmsnorm2d to 10_rmsnorm2d
      2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
      
      * clang format and add missing header
      
      * Add host validation of add + layernorm2d + rsquant
      
      * Revert "Add host validation of add + layernorm2d + rsquant"
      
      This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67.
      
      * Remove deprecated flag
      3d609534
    • Qianfeng's avatar
      [CK_TILE] Add fmha fwd headdim96 support (#1608) · 86322218
      Qianfeng authored
      
      
      * Add ceil_to_qualified_tile_length()
      
      * Rename kK0BlockLength to kQKHeaddim
      
      * Add kSubQKHeaddim concept to support headdim96
      
      * Fix in math.hpp to avoid using __half interfaces
      
      * Add LdsBufferSequence instance for headdim96
      
      * Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing
      
      * Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time
      
      * Reformat one file
      
      * Fix text alignment in fmha_fwd_splitkv.py
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      86322218
  8. 29 Oct, 2024 2 commits
  9. 26 Oct, 2024 2 commits
    • carlushuang's avatar
      topk_softmax (#1592) · b098b71b
      carlushuang authored
      * topk_softmax
      
      * remove some file
      
      * fix atomix linear_offset
      
      * address various comment, and change sfc get_index api to static(tuple)
      b098b71b
    • Po Yen Chen's avatar
      [CK_TILE] More fmha splitkv optimizations (#1588) · 54f0e6f4
      Po Yen Chen authored
      * Use pre-defined constants for readability
      
      * Use vector write for o_acc tensor
      
      * Remove no-longer used policy method
      
      * Deprecate no-longer used policy/pipeline
      
      * Specify gemm0/gemm1 block warps separately in codegen
      
      * Fix wrong ps_idx creation logic
      
      * Add single-warp block gemm
      
      * Supoprt single-warp gemm0
      
      * Make MakeCBlockTile() as static method
      
      * Use MakeCBlockTile() to get underlying tile distribution
      
      * Use kNumGemm1Warps to compute # threads for gemm1
      
      * Put normal case in the if clause
      
      * Refine fmha splitkv block mapping
      
      * Refine & fix the lse_acc/o_acc layout
      
      * Fix wrong LDS size for K tile
      
      * Use kK0=64 for hdim=128,256 fmha splitkv kernels
      
      * Use kK1=64 for hdim=32,64,128 fmha splitkv kernels
      
      * Undo kK0/kK1 changes
      
      * Use more reasonable GetAlignmentV() computation
      
      * Using store_tile() in fmha splitkv kernel epilogue
      54f0e6f4
  10. 25 Oct, 2024 1 commit
  11. 22 Oct, 2024 1 commit
    • ltqin's avatar
      update layernorm (#1570) · 0394f8a7
      ltqin authored
      * port layernorm
      
      * change warp_welford.hpp
      
      * Update warpshuffle
      
      * 1. Add save mean and save std back
      2. Move construction of tensor_view and tile_window to operator()
      
      * refine welford max count calculation
      
      * unify layernorm api
      
      * Rename file
      
      * Remove save mean and inv std
      
      * Revert "refine welford max count calculation"
      
      This reverts commit 02236580
      
      .
      
      * Fix order of parameter
      
      * refine welford max count calculation again
      
      * Remove fp32 instances
      
      * Fix bug of padding
      
      * refactor api
      
      * Support bf16
      
      * Extract common function
      
      * Refine arg of operator()
      
      * Add kMThreadPerBlock to template parameter
      
      * clang format
      
      * Refine variable name
      
      * Refine file name
      
      * remove redundant line
      
      * refactor layernorm2d pipeline and add block-per-block utility
      
      * fix name
      
      * rename more
      
      * add more block-per-tile instance
      
      * remove duplicated define
      
      * update instance for 2048, 1024 case
      
      * support up to 2048 now
      
      * opt loading
      
      * add n1536
      
      * Add two pass pipeline
      
      * format
      
      * Fix incorrect type
      
      * parallel compilation
      
      * Use smaller N
      
      * fix 2p pass
      
      * Support Repeat_M in distribution
      
      * Refine nameing
      
      * Add reduce example
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      Co-authored-by: default avataraska-0096 <haocwang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      0394f8a7
  12. 21 Oct, 2024 1 commit
    • Po Yen Chen's avatar
      [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577) · 95e722a3
      Po Yen Chen authored
      * Use smaller width for lse_accum dist tensor
      
      * Update pipeline comment
      
      * Fix wrong distribution for lse_accum
      
      * Remove duplicate dim in lse_accum dist encoding
      
      * Decide fmha splitkv combine kernel kBlockSize by kM0
      
      * Remove assumption of MPerThread=1
      
      * Add log<4> & log<8> specialization
      
      * Enlarge occupancy array
      
      * Fix vector size for small tile
      
      * Add support for kMaxSplits=8
      
      * Re-format gemm.hpp
      
      * Use 16x16x16 warp gemm for fwd_splitkv
      
      * Centralize policy code changes
      
      * Leave fp8/bf8 tile settings unchanged
      95e722a3
  13. 16 Oct, 2024 1 commit
    • Qianfeng's avatar
      [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573) · 14c3cfb1
      Qianfeng authored
      
      
      * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
      
      * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
      
      * Fix in bwd_piple_default_policy
      
      * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
      
      * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
      
      * Update to hd96 tile settings
      
      * Add smoke test scripts for fmha-bwd hd96
      
      * Revert "Add smoke test scripts for fmha-bwd hd96"
      
      This reverts commit 7ca7e1a93dc65eb99ce3ff4e82693589830e42a2.
      
      * Remove hd96 tile settings in fmha_bwd codegen to save compiling
      
      * Fix lost code line in bwd_pipeline_default_policy
      
      * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
      
      * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
      
      * tiny adjustments
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatardanyao12 <Dan.Yao@amd.com>
      14c3cfb1
  14. 15 Oct, 2024 2 commits
  15. 14 Oct, 2024 1 commit
  16. 10 Oct, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98
      Thomas Ning authored
      
      
      * ake the cshuffle compilable
      
      * modify Mhe reference on gpu and cpu. Correaccess of cshuffle
      
      * fix the cpu reference code
      
      * Complete the in tile shuffle logic
      
      * restructure the kernel template input
      
      * change the naming pattern of ck_tile gemm pipeline
      
      * Re-format files using remod.py
      
      * Solve the fmha conflict with gemm
      
      * Comment Addressed from Carlus
      
      ---------
      Co-authored-by: default avatarPo Yen, Chen <PoYen.Chen@amd.com>
      6f27bc98
  17. 08 Oct, 2024 2 commits
  18. 07 Oct, 2024 2 commits
  19. 04 Oct, 2024 1 commit
    • kylasa's avatar
      Adding seed and offset pointer support to the philox random number generator. (#1523) · c24fae23
      kylasa authored
      
      
      * Adding seed and offset pointer support to the philox random number generator.
      
      * Separating seed and offset pointer checks with different condition statements.
      
      * Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
      
      * Correcting a typo in the readme file
      
      * Re-format files using remod.py
      
      * Use STL type for API parameters
      
      * Use simpler struct design for drop_seed & drop_offset
      
      * Undo unnecessary changes
      
      * Sync kargs style for fmha_fwd.hpp/.cpp
      
      * Use templated union to reduce code
      
      * Use structured binding to make code more readable
      
      ---------
      Co-authored-by: default avatarSudhir Kylasa <sukylasa@amd.com>
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      c24fae23
  20. 01 Oct, 2024 2 commits
  21. 27 Sep, 2024 1 commit
  22. 26 Sep, 2024 1 commit
  23. 25 Sep, 2024 1 commit
  24. 22 Sep, 2024 1 commit
  25. 18 Sep, 2024 1 commit
  26. 14 Sep, 2024 1 commit
  27. 10 Sep, 2024 1 commit
  28. 07 Sep, 2024 1 commit
    • Thomas Ning's avatar
      Ck tile gemm example (#1488) · caacd388
      Thomas Ning authored
      
      
      * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
      
      * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
      
      * Fix: Clang Format, API fixed from fmha
      
      * fix with better naming convention
      
      * revert back the pipeline code of fmha
      
      * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
      
      * clang format with the reference_gemm file
      
      * convert the clang format with the remod.py
      
      * Changed the format and variable name of the kernel gemm_shape and partitioner
      
      ---------
      Co-authored-by: default avatarthomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
      caacd388
  29. 30 Aug, 2024 1 commit