- 27 Nov, 2024 3 commits
-
-
jakpiase authored
* add interwave scheduler for gemm mem pipeline * Fix merge artifacts. * Refactor unit tests. * Switch to interwave scheduler for mem example --------- Co-authored-by:
Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by:
Adam Osewski <Adam.Osewski@amd.com>
-
Illia Silin authored
-
Adam Osewski authored
* Few small fixes. * New GroupedGemm instances (BF16) * Unify and refactor GroupedGEMM device API. * Adapt changes to new API. * Adapt grouped gemm profiler. * Accept multiple kbatches for grouped gemm profiler. - delete obsolete two stage as it is now covered by grouped gemm * Update unit test for grouped gemm. * Fix thresholds for BF16 and F8. Unblock tests. * Fix few instances. * Multiple small fixes. * Adapt to new API, check dynamic casting. * Uncomment few data types in grouped gemm profiler. * Fix call to SetDeviceArgs. * Fix profile grouped gemm multiply tile loop. * Fix grouped gemm tile loop kernel args in client examples. * Review comments.
-
- 26 Nov, 2024 6 commits
-
-
rocking authored
* Fix cmake example build * Support max3 in smoothquant one pass * support max3 in two pass * support max3 in add_rmsnorm_rdquant
-
Adam Osewski authored
* Fix loop order. * Fix loop order in pipeline v4
-
jakpiase authored
* add check for bf16 splitk support for grouped gemm splitk * Update if condition --------- Co-authored-by:Adam Osewski <19374865+aosewski@users.noreply.github.com>
-
Po Yen Chen authored
* Allow getting batch size from splitkv tile partitioner * Fix wrong paged-kvcache impl for group mode * Fix wrong example code for page-kvcache * Undo changes in fmha_fwd.cpp * Always use 2D block table * Add is_gappy kernel argument for paged-kvcache The is_gappy argument is used for differentiating seqstart_k_ptr usage in flash-attention & xformers * Remove out-of-date comments * Remove no-longer used method * Fix wrong # page-block calculation * Fix wrong comment --------- Co-authored-by:Qianfeng <qianfeng.zhang@amd.com>
-
Adam Osewski authored
* Block universal gemm. * Universal block gemm with interwave scheduler - draft. * Refactoring * Move a/b_warp_tiles into BlockGemmImpl * set BlockGemmImpl as a class member * Change tile size for more suitable to memory bound cases. * Introduce kKPerThread to WarpGemm * Add documentation comment. * Fix Interwave scheduler block gemm. * Add compute/memory friendly tile configuration. * Clean * New tile configurations in gemm mem example. * Add more static checks and fix loop order in block gemm. * Add more static checks and use warp gemm mfma dispatcher. * Add default scheduler block gemm. * Remove logging in example.
-
carlushuang authored
* moe pipeline * update code * compile OK * update * update cpu reference * update pipeline_gemm0 * compiler ok * update pipeline * rename to ex pipeline * block-asm * update * update * update first gemm ok * compute correct * update file structure * update README * update * update * update code * update API * return unsupport case * add comment * update readme * update * uncomment * update * fix build err --------- Co-authored-by:valarLip <340077269@qq.com>
-
- 25 Nov, 2024 3 commits
-
-
Po Yen Chen authored
* Fix mis-matched tuple<> elem types * Rename MakeKargs() as MakeKargsImpl() --------- Co-authored-by:Qianfeng <qianfeng.zhang@amd.com>
-
carlushuang authored
* update MOCK_ID for moe-sorting * add moe-smoothquant * update a comment * fix format * hot fix * update topk in overflow case * update comments * update bf16 cvt --------- Co-authored-by:valarLip <340077269@qq.com>
-
Qianfeng authored
* Change in fwd-splitkv kernel to support num_splits=1 case * Update in codegen fwd-splitkv to make num_splits > 1 cases pass * Specify instance traits in dispatch * Fix link error for fp8 kernels --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 22 Nov, 2024 1 commit
-
-
schung-amd authored
* Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples. * Add overloads for MakeKargs Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples. * Re-format files using ck_tile remod.py --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 21 Nov, 2024 2 commits
-
-
Harisankar Sadasivan authored
* universal streamk fp8 changes & ckprofiler instances * revert strides to -1 and verification options * fp8 exclusion on pre-gfx94 for universal_streamk * PR review based revisions: permissions reverted, removed hip err checks --------- Co-authored-by:Illia Silin <98187287+illsilin@users.noreply.github.com>
-
Po Yen Chen authored
* Generate group mode paged-attn kernel * Enable paged-kvcache + group mode support * Add missing header: fused_moe.hpp * Add comment to explain kernel arg usage * Make error message more clear * Add comment for confusing data member names * Add more comment for confusing variable names * Fix typo in option description
-
- 18 Nov, 2024 2 commits
-
-
Illia Silin authored
* add bf16 gemms for gfx11/gfx12 * reduce the input values in test_gemm * add int8 wmma gemm instances for gfx11/gfx12 * add example gemm_wmma_int8 * fix bug in gemm_wmma_int8 test * increase bf16 gemm test tolerance * update the dates and clean-up commented-out instances
-
Bartłomiej Kocot authored
* Batched GEMM Multiple D based on Universal GEMM Co-authored-by:
Jing Zhang <jizhan@fb.com> * CI fixes Co-authored-by:
Jing Zhang <jizhan@fb.com> --------- Co-authored-by:
Jing Zhang <jizhan@fb.com>
-
- 14 Nov, 2024 1 commit
-
-
feli authored
Co-authored-by:dummycoderfe <noplydummmycoder@163.com>
-
- 13 Nov, 2024 3 commits
-
-
Illia Silin authored
-
Taylor Ding authored
-
Bartłomiej Kocot authored
* [CK TILE] Update gemm universal pipeline * Fixes * fix * Rebase
-
- 12 Nov, 2024 1 commit
-
-
Thomas Ning authored
* Finished the feature * Modified the test file * Test case update * addresss comment * Addressed the review comment * Fixed the CI error
-
- 11 Nov, 2024 2 commits
-
-
valarLip authored
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor * align CK coding style * extend strides to layernrom expample * clang-format...
-
Po Yen Chen authored
-
- 09 Nov, 2024 1 commit
-
-
dummycoderfe authored
* add moe_sorting & check ok * fix comments & typo * Run remod.py under include/ck_tile & example/ck_tile directories * format codes * fix output ci check bug * fix moe sorting readme and error commit file * use magiv div to accelerate compute * add an loop unroll for moe lds ops * add extblocksnel to set zeros for moebufs * [Ck_tile] moe set zero run ok, add size check and fix ref check * [Ck_tile]fix moe_sorting fuse set_zero remod * [Ck_tile] change name style, fix zero buffer size err, change folder * [Ck_tile] moe_sorting: fix name style * [Ck_tile] moe_sorting, remove useless params in traits * [Ck_tile] change outputtile cnt * unit_size; change output buf alloc --------- Co-authored-by:
dummycoderfe <noplydummmycoder@163.com> Co-authored-by:
Po Yen, Chen <PoYen.Chen@amd.com> Co-authored-by:
carlushuang <carlus.huang@amd.com>
-
- 08 Nov, 2024 1 commit
-
-
dummycoderfe authored
* optimze small N case using vec io and using rcp div * [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass * [Ck_tile] fix blockSize compute in Generic2dBlockShape * [Ck_tile]fix kfastfdiv template style * [Ck_tile] layernorm, fix stype in review --------- Co-authored-by:dummycoderfe <noplydummmycoder@163.com>
-
- 07 Nov, 2024 1 commit
-
-
Illia Silin authored
-
- 05 Nov, 2024 1 commit
-
-
darren-amd authored
* explicit cast ptr offset * formating change
-
- 02 Nov, 2024 1 commit
-
-
carlushuang authored
* more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 01 Nov, 2024 2 commits
-
-
rocking authored
* fix compile error * fix typo of padding * Add smoothquant op * Add smoothquant instance library * refine type * add test script * Re-generate smoothquant.hpp * Always use 'current year' in copyright * use Generic2dBlockShape instead * Add vector = 8 instance back * Find exe path automatically * Simplify the api condition * Remove debugging code * update year * Add blank line between function declaration * explicitly cast return value to dim3 * refine return value * Fix default warmup and repeat value * Add comment * refactor sommthquant cmake * Add README * Fix typo --------- Co-authored-by:Po Yen, Chen <PoYen.Chen@amd.com>
-
carlushuang authored
* hot fix ln * some rename
-
- 31 Oct, 2024 1 commit
-
-
carlushuang authored
* add prenorm/postnorm support, refactor using generate.py * update README * update README * fix format * update some description and fix format * update format * format * use non-raw for loading * format and update n4096 * dynamic-quant ready * update readme * support fused dynamic-quant * update fused-quant, with smooth * update README * update args * update some based on comment
-
- 30 Oct, 2024 5 commits
-
-
Bartłomiej Kocot authored
* Remove virtual destructors from unary ops * Fixes * Fixes * clang format fixes
-
rocking authored
-
Adam Osewski authored
* CK-Tile GEMM with memory bound pipeline. * Memory bound gemm pipeline. * Fix not closed namespace. * Block gemm mem pipeline draft. * Do not use ck_tile:: within ck_tile namespace. * Refactoring & Move Layout info to pipeline problem. * Get hot loop and TailNum information before lunching kernel. * Fixes in pipeline. * Add comment to load_tile_raw and change variable naming style. * Few small changes & formatting. * Do not use macro. * Add gtests. * Use AccDataType for Output of MFMA instruction. * Formatting. * Refactor gemm examples. * Switch over to current block gemm. * Use currently available pipeline policy. * Refactoring and review comment.s * Fixes after merge. * Add missing include. * Add load tile overload which accepts output tensor as parameter. * This give 8% perf boost at the cost of using more registers. * Rename example. * Small changes. * Fix compilation err and lower K. * Support different layouts for A/B * Fix vector size for different layouts. * Rename Alignment into VectorSize * Unblock tests.
-
rocking authored
* Add reduce2d new api * Prevent user use cross warp reduction * Fix bug of std caculation * Add rmsnorm2d * Add rmsnorm small example * Remove static assert to prevent compile fail * Add script to test performance and correctness * Add missing cmake change * refine naming * refine example of rmsnorm * Fix bug of rmsnorm * Refine naming * Fix cmake * clang format * Refine pipeline name * Add add_rmsnorm2d_rdquant kernel * Add reduce op * host verification * Fix bug of one pass pipeline * Refine tile size * Add two pass pipeline * Rename two pass to three pass * Fix bug of kSaveX == false * Add instance library * Add test script * Fix bug of x verification * Add save_x to trait * Add README * Move reduce2d into reduce folder * Fix bug of welford when number of m warp > 1 * remove reduncant comment * 1. move 06_rmsnorm2d to 10_rmsnorm2d 2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant * clang format and add missing header * Add host validation of add + layernorm2d + rsquant * Revert "Add host validation of add + layernorm2d + rsquant" This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67. * Remove deprecated flag
-
Qianfeng authored
* Add ceil_to_qualified_tile_length() * Rename kK0BlockLength to kQKHeaddim * Add kSubQKHeaddim concept to support headdim96 * Fix in math.hpp to avoid using __half interfaces * Add LdsBufferSequence instance for headdim96 * Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing * Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time * Reformat one file * Fix text alignment in fmha_fwd_splitkv.py --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 29 Oct, 2024 3 commits
-
-
valarLip authored
-
valarLip authored
-
Illia Silin authored
-