1. 16 Aug, 2024 3 commits
    • Illia Silin's avatar
      Re-enable fp8 types for all architectures. (#1470) · c8b6b642
      Illia Silin authored
      * re-enable fp8 and bf8 for all targets
      
      * restore the fp8 gemm instances
      
      * re-enable conv_3d fp8 on all architectures
      
      * diasble several fp8 gemm instances on all architectures except gfx94
      
      * clang format fix
      c8b6b642
    • Dan Yao's avatar
      [CK_TILE] FA bwd kernels optimization (#1397) · 79a5d9c1
      Dan Yao authored
      
      
      * tmp save
      
      * fix batch deterministic bugs
      
      * fix group deterministic bugs
      
      * codegen update
      
      * reorder files
      
      * bias support
      
      * hd256 bias support
      
      * bwd smoke test update
      
      * simplify convert dq
      
      * fix hd256 dropout scratch
      
      * do{}while() -> while(){}
      
      * comments
      
      * remove FmhaBwdTilePartitioner
      
      * save clear_tile
      
      * refactor dropout
      
      * code cleanup
      
      * code cleanup
      
      * comments
      
      * fix epilogue problem
      
      * fix fwd dropout
      
      * group convert_dq opt
      
      * fix dq alignment
      
      * Do not store storerandval in bwd for flash attention integration
      
      * fix hd32 error and boost performance
      
      * revert
      
      * Remove duplicated WarpGemm definitions in the policy file
      
      * dropout patch for mrepeat 16*16
      
      * code sync up
      
      * dq_acc stride
      
      * dq_acc stride stuff
      
      * codegen update
      
      * fwd dropout revert
      
      * fix hd128 scratches and boost performance
      
      * receipt 3 for simplified smoke test
      
      * more strides for fa integration
      
      * fix hd64 scratches and boost performance
      
      * non-iglp pipeline for headdim padding cases
      
      * dpad same as dvpad for flash attention integration
      
      * unpadded lse&d for group mode
      
      * Support unpad layout for group lse
      
      * Support unpad lse layout for splitkv
      
      * Fix stride for splitkv kernel
      
      * fix unpadded lse issue in fwd splitkv
      
      * comment
      
      * solve lds read&write conflicts
      
      * rename
      
      * bias rename
      
      * tile index revert
      
      ---------
      
      Co-authored-by: danyao12 <danyao12>
      Co-authored-by: default avatarrocking <ChunYu.Lai@amd.com>
      Co-authored-by: default avatarQianfeng Zhang <Qianfeng.Zhang@amd.com>
      79a5d9c1
    • Bartłomiej Kocot's avatar
      Add performance and large tensor tests for grouped conv (#1456) · 2581727d
      Bartłomiej Kocot authored
      
      
      * Add performance and large tensor tests for grouped conv
      
      * Resize tests
      
      * Resize tests
      
      * update the python script to parse the grouped_conv results
      
      * Remove int8 tests
      
      * change bwd wei layout
      
      ---------
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      2581727d
  2. 15 Aug, 2024 2 commits
  3. 14 Aug, 2024 1 commit
    • Haocong WANG's avatar
      [GEMM] gemm_universal related optimization (#1453) · 3049b546
      Haocong WANG authored
      
      
      * replace buffer_atomic with global_atomic
      
      * fixed global_atomic_add
      
      * added bf16 atomic_add
      
      * format
      
      * clang-format-12
      
      * clean
      
      * clean
      
      * add guards
      
      * Update gtest.cmake
      
      * enabled splitk_gemm_multi_d
      
      * format
      
      * add ckProfiler
      
      * format
      
      * fixed naming
      
      * format
      
      * clean
      
      * clean
      
      * add guards
      
      * fix clang format
      
      * format
      
      * add kbatch printout
      
      * clean
      
      * Add rocm6.2 related gemm optimization
      
      * Limit bf16 atomic usage
      
      * remove redundant RCR gemm_universal instance
      
      * Add RRR fp8 gemm universal instance
      
      * Bug fix
      
      * Add GPU_TARGET guard to FP8/BF8 target
      
      * bug fix
      
      * update cmake
      
      * remove all fp8/bf8 example if arch not support
      
      * Enable fp8 RRR support in ckProfiler
      
      * limit greedy-reverse flag to gemm_universal in ckProfiler
      
      ---------
      Co-authored-by: default avatarJing Zhang <jizhan@fb.com>
      Co-authored-by: default avatarJing Zhang <jizhan@meta.com>
      Co-authored-by: default avatarzjing14 <zhangjing14@gmail.com>
      Co-authored-by: default avatarIllia Silin <98187287+illsilin@users.noreply.github.com>
      Co-authored-by: default avatarillsilin <Illia.Silin@amd.com>
      3049b546
  4. 13 Aug, 2024 2 commits
    • AngryLoki's avatar
      Fix compilation errors with libc++ (#1461) · 50c42348
      AngryLoki authored
      
      
      This fixes 2 issues when compiled with libc++.
      
      First issue is attempt to call std::numeric_limits<ranges::range_value_t<_Float16>>::min().
      _Float16 is extension of libstdc++, it does not exist in C++ standard[2].
      Luckily, there is NumericLimits class in composable_kernel, which does everything needed.
      
      Second issue with call to 'check_err' is ambiguous: there are 2 candidates.
      It happens because composable_kernel relies on idea that f8_t (defined as _BitInt(8)) does not pass is_integral trait.
      However, libc++ treats _BitInt(N) as integral (per standard "any implementation-defined extended integer types" can be integral).
      
      Closes: #1460
      Signed-off-by: default avatarSv. Lockal <lockalsash@gmail.com>
      50c42348
    • Mateusz Ozga's avatar
  5. 12 Aug, 2024 2 commits
  6. 10 Aug, 2024 1 commit
  7. 09 Aug, 2024 2 commits
  8. 08 Aug, 2024 3 commits
  9. 07 Aug, 2024 4 commits
  10. 06 Aug, 2024 7 commits
  11. 05 Aug, 2024 2 commits
  12. 01 Aug, 2024 1 commit
  13. 31 Jul, 2024 4 commits
  14. 30 Jul, 2024 2 commits
  15. 26 Jul, 2024 2 commits
  16. 25 Jul, 2024 2 commits