- 30 Jan, 2025 4 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 29 Jan, 2025 2 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 27 Jan, 2025 1 commit
-
-
Rostyslav Geyyer authored
-
- 24 Jan, 2025 2 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 22 Jan, 2025 3 commits
- 16 Jan, 2025 4 commits
-
-
Bartłomiej Kocot authored
* Fix and optimize dynamic unary elementwise * fix
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
carlushuang authored
* fix mock token id * prepare host for g1u1 * reformat inline-asm * restructure uk_0 * restructure gate_up * done * change default to init=1 * update readme * fix a bug in interleave pipeline * rcp for silu
-
- 15 Jan, 2025 2 commits
-
-
Bartłomiej Kocot authored
* Add rounding for float to bf16 conversion * Add bhalf test * Add inf test bhalf * Refactor * update cmake * Fixes
-
ruanjm authored
* Add shortcut to RMSNorm * Modify test for adding shortcut for RMSNorm * Add fused parameter into tests * 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp * 1. Supports various stride and percisions. * Add support of Epilogue * Add fuse and epilogue support to rmsnorm ref * Modify rmsnorm example * Refactor tests/examples * Bug fix for newly added tests/examples * Bug fix for new tests 2 * Modify smoke test scripts remove dbg code * Supports non-smooth dyanmic quant * Update Rmsnorm2dFwd::GetName() * rename xscale and prec_sx to smoothscale and prec_sm Bug fix after rename Remove files * change example_rmsnorm2d_fwd.cpp * update performance calculator * Fix issue in two-pass when fuse add is enabled * Remove comment of beta --------- Co-authored-by:rocking <ChunYu.Lai@amd.com>
-
- 13 Jan, 2025 2 commits
-
-
Thomas Ning authored
* refactor the block_gemm_areg_breg_creg_v1 and add the v2 policy with 2x2 warp gemm * Finished the 2x2 warp gemm policy and the block selection mechanism * Clang format * address poyen's comment * Address feedbacks * Fixed the compilation issue * Change the function name
-
Qianfeng authored
* Update for fmha_fwd qs_ks_vs pipeline * Remove _builtin_amdgcn_sched_barrier(0) * Move p_compute to p converting earlier for trying to increase vgprs re-using * Enable GetQKBlockGemm to use WarpGemm-16x16x16 for QLoadOnce==false situation * Re-add __builtin_amdgcn_sched_barrier(0) --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 10 Jan, 2025 1 commit
-
-
Bartłomiej Kocot authored
* Grouped convolution backward weight special vector size loads * Instnaces and tests * Fixes * Add 7 and 13 special cases * fix comments * Fix * Fix2 * fixes * fix atomic add bf16
-
- 08 Jan, 2025 12 commits
-
-
darren-amd authored
* Disable building DPP kernels by default * Disable building dpp instances, examples, or tests if DPP_KERNELS is not set * Add new DPP_KERNELS flag to readme
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
Max Podkorytov authored
-
AMD-dteng authored
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16 * delete comment * 1.remove fmha change 2.change buffer name from bias to xbias * Now bias can be used independently from fadd * change kbias to kxbias --------- Co-authored-by:feli <felix.li@amd.com>
-
- 07 Jan, 2025 2 commits
-
-
Andriy Roshchenko authored
* Move scaled_type_convert functions to a separate header * Introduce MX data tests * Build MX tests only on relevant architectures * Refactor E8M0 scale implementation * Fix `config.h` typo * Cleanup deprecated symbols * Refactor `amd_ck_fp8.hpp` * `scaled_type_convert` for `f8_ocp_t` * Implement test for MX FP8 scaled type convert * Implement test for MX BF8 scaled type convert * Scaled type convert for vectors of 2 FP8 elements * Scaled type convert for vectors of 16 FP8 elements * Implementation of scaled conversion from F32 to F8 * Add tests for scaled conversions from FP32 to FP8 * Add documentation to the test functions * Implementation of scaled conversion from F32x2 to F8x2 * Implementation of scaled conversion from F32x16 to F8x16 * Implementation of scaled conversion from F32x32 to F8x32 * Implementation of scaled conversion from F8x32 to F32x32 * Verified on the emulator
-
Po Yen Chen authored
* Update license year * Add initial code to override decode problem * Fix splitkv traits/args overriding error * Reshape and transpose lse for decode * Remove debug code * Prettify example code * Use better function name * Add kMergeNumHeadGroupsSeqLenQ flag Kernel user can use this switch to turn on/off optimization for some problem sizes * Add missing flag declarations * Default turn off kMergeNumHeadGroupsSeqLenQ in codegen * Group similar statements together * Remove assumption of seqlen_q=1 * Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel * Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel * Run kMergeNumHeadGroupsSeqLenQ=true kernels when need * Fix group mode block skip logics * Undo changes of normal fwd kernel * Update in GridSize() and using GridSize() for splitkv kernel (#1799) --------- Co-authored-by:Qianfeng <qianfeng.zhang@amd.com>
-
- 06 Jan, 2025 1 commit
-
-
Rostyslav Geyyer authored
* Add conversions * Add tests * Add docstrings * Add scaled conversions * Add fp6/bf6 tests * Remove misleading fp4 test case * Add docstrings * Clean up * Address comments * Set stricter tolerances for RNE tests * Add missing tests * Add native conversions to float * Revert "Add native conversions to float" This reverts commit 09467111f73b753c8cc3d597533b187940353dab. * Update copyright years
-
- 04 Jan, 2025 2 commits
-
-
Bartłomiej Kocot authored
* Fix universal gemm profiler for pk_i4_t * fix
-
Illia Silin authored
-
- 03 Jan, 2025 2 commits
-
-
carlushuang authored
* quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
Mingtao Gu authored
* enable int4 scale (weight only) kernel * format some files * Add unit test for int4 weight only * fixed and formatted code * fixed * formated * formated * fixed * fixed a bug in the ckProfiler, and formatted the code --------- Co-authored-by:mtgu0705 <mtgu@amd.com>
-