1. 23 Oct, 2024 1 commit
  2. 22 Oct, 2024 3 commits
    • Jatin Chaudhary's avatar
    • Bartłomiej Kocot's avatar
      Enable grouped conv bwd wei bf16 NGCHW (#1589) · 82fc5383
      Bartłomiej Kocot authored
      * Enable grouped conv bwd wei bf16 NGCHW
      
      * fixes
      
      * fixes
      
      * Fixes
      
      * fixes
      
      * fixes
      
      * Fixes
      82fc5383
    • ltqin's avatar
      update layernorm (#1570) · 0394f8a7
      ltqin authored
      * port layernorm
      
      * change warp_welford.hpp
      
      * Update warpshuffle
      
      * 1. Add save mean and save std back
      2. Move construction of tensor_view and tile_window to operator()
      
      * refine welford max count calculation
      
      * unify layernorm api
      
      * Rename file
      
      * Remove save mean and inv std
      
      * Revert "refine welford max count calculation"
      
      This reverts commit 02236580
      
      .
      
      * Fix order of parameter
      
      * refine welford max count calculation again
      
      * Remove fp32 instances
      
      * Fix bug of padding
      
      * refactor api
      
      * Support bf16
      
      * Extract common function
      
      * Refine arg of operator()
      
      * Add kMThreadPerBlock to template parameter
      
      * clang format
      
      * Refine variable name
      
      * Refine file name
      
      * remove redundant line
      
      * refactor layernorm2d pipeline and add block-per-block utility
      
      * fix name
      
      * rename more
      
      * add more block-per-tile instance
      
      * remove duplicated define
      
      * update instance for 2048, 1024 case
      
      * support up to 2048 now
      
      * opt loading
      
      * add n1536
      
      * Add two pass pipeline
      
      * format
      
      * Fix incorrect type
      
      * parallel compilation
      
      * Use smaller N
      
      * fix 2p pass
      
      * Support Repeat_M in distribution
      
      * Refine nameing
      
      * Add reduce example
      
      ---------
      Co-authored-by: default avatarletaoqin <letaoqin@amd.com>
      Co-authored-by: default avataraska-0096 <haocwang@amd.com>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarcarlushuang <carlus.huang@amd.com>
      0394f8a7
  3. 21 Oct, 2024 5 commits
  4. 18 Oct, 2024 2 commits
  5. 16 Oct, 2024 1 commit
    • Qianfeng's avatar
      [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573) · 14c3cfb1
      Qianfeng authored
      
      
      * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96
      
      * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor
      
      * Fix in bwd_piple_default_policy
      
      * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines
      
      * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim
      
      * Update to hd96 tile settings
      
      * Add smoke test scripts for fmha-bwd hd96
      
      * Revert "Add smoke test scripts for fmha-bwd hd96"
      
      This reverts commit 7ca7e1a93dc65eb99ce3ff4e82693589830e42a2.
      
      * Remove hd96 tile settings in fmha_bwd codegen to save compiling
      
      * Fix lost code line in bwd_pipeline_default_policy
      
      * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits
      
      * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor
      
      * tiny adjustments
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      Co-authored-by: default avatardanyao12 <Dan.Yao@amd.com>
      14c3cfb1
  6. 15 Oct, 2024 3 commits
  7. 14 Oct, 2024 3 commits
  8. 12 Oct, 2024 1 commit
  9. 11 Oct, 2024 1 commit
  10. 10 Oct, 2024 4 commits
  11. 09 Oct, 2024 3 commits
  12. 08 Oct, 2024 3 commits
    • Rostyslav Geyyer's avatar
      Add a gpu gemm reference kernel (#1528) · aa932445
      Rostyslav Geyyer authored
      
      
      * Add a gpu gemm reference kernel
      
      * Switch to gpu reference in gemm examples
      
      * Remove redundant arguments
      
      * Update all related examples
      
      * Update more examples
      
      * Try less threads per block
      
      * Try even less threads per block
      
      * Add support for all matrix layouts
      
      * Increase block size
      
      * Clean up
      
      * Remove hardcoded strides
      
      * Clean up
      
      * Try a column-major case
      
      * Revert back to row-major
      
      * Run both CPU and GPU veriffication
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      aa932445
    • Po Yen Chen's avatar
      [CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa
      Po Yen Chen authored
      * Fix text alignment of ArgParser::print()
      
      * Update example README files
      
      * Clarify make-ck-dev.sh <arch> usage
      
      * Only keep some of the argument from '-?' output
      
      * Undo command line output changes in README
      
      * Only keep existing argument on doc and update description
      
      * Fix text alignment
      
      * Make cmake-ck-*.sh compatible with 'sh' command
      0c094daa
    • Qianfeng's avatar
      [CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b
      Qianfeng authored
      
      
      * Simplify the codes in splitkv_combine pipeline
      
      * Always set kPadSeqLenK=true for fmha splitkv kernels
      
      * Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
      
      ---------
      Co-authored-by: default avatarPo Yen Chen <PoYen.Chen@amd.com>
      74d68e3b
  13. 07 Oct, 2024 4 commits
  14. 04 Oct, 2024 3 commits
  15. 02 Oct, 2024 2 commits
  16. 01 Oct, 2024 1 commit