- 21 Nov, 2024 3 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 19 Nov, 2024 1 commit
-
-
Rostyslav Geyyer authored
-
- 18 Nov, 2024 2 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 15 Nov, 2024 1 commit
-
-
Rostyslav Geyyer authored
-
- 08 Nov, 2024 2 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 07 Nov, 2024 2 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
- 06 Nov, 2024 6 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
illsilin authored
-
illsilin authored
-
Illia Silin authored
Merge from public
-
- 05 Nov, 2024 7 commits
-
-
illsilin authored
-
Andriy Roshchenko authored
-
Illia Silin authored
-
darren-amd authored
* explicit cast ptr offset * formating change
-
Illia Silin authored
* make sure cmake can handle xnack targets * dont build xdl instances for gfx906:xnack- * dont build xdl tests for gfx906:xnack-
-
Juan Manuel Martinez Caamaño authored
Before, generate.py appended the list at the end of the output file. When running the cmake configuration steps multiple times on the examples, the blob list (such as fwd_blob_list.txt) would grow at every configuration. `library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around this issue by removing the output file if it exists. Now, generate.py overrides the content of the output file. There is no need for the workaround in the CMakeLists.txt; and the issue is solved for the example projects too.
-
Lin Sun authored
Add instances for int8 grouped conv2d fwd --------- Co-authored-by:
root <root@dell300x-pla-t28-03.pla.dcgpu> Co-authored-by:
Bartłomiej Kocot <barkocot@amd.com>
-
- 04 Nov, 2024 2 commits
-
-
Bartłomiej Kocot authored
* Temporary disable part of dynamic op conv instances * fix
-
Rostyslav Geyyer authored
-
- 02 Nov, 2024 1 commit
-
-
carlushuang authored
* more accurate residual * modify comment * Fix literal case in README.md --------- Co-authored-by:Po Yen Chen <PoYen.Chen@amd.com>
-
- 01 Nov, 2024 5 commits
-
-
Andriy Roshchenko authored
-
Illia Silin authored
* disable fp8 gemm_universal on gfx90a and gfx908 by default * fix cmake syntax * fix clang format * add ifdefs in amd_xdlops * disable fp8 gemm instances on gfx90a by default * update readme
-
rocking authored
* fix compile error * fix typo of padding * Add smoothquant op * Add smoothquant instance library * refine type * add test script * Re-generate smoothquant.hpp * Always use 'current year' in copyright * use Generic2dBlockShape instead * Add vector = 8 instance back * Find exe path automatically * Simplify the api condition * Remove debugging code * update year * Add blank line between function declaration * explicitly cast return value to dim3 * refine return value * Fix default warmup and repeat value * Add comment * refactor sommthquant cmake * Add README * Fix typo --------- Co-authored-by:Po Yen, Chen <PoYen.Chen@amd.com>
-
carlushuang authored
* hot fix ln * some rename
-
Illia Silin authored
Update develop branch from public repository
-
- 31 Oct, 2024 2 commits
-
-
Andriy Roshchenko authored
-
carlushuang authored
* add prenorm/postnorm support, refactor using generate.py * update README * update README * fix format * update some description and fix format * update format * format * use non-raw for loading * format and update n4096 * dynamic-quant ready * update readme * support fused dynamic-quant * update fused-quant, with smooth * update README * update args * update some based on comment
-
- 30 Oct, 2024 6 commits
-
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Rostyslav Geyyer authored
-
Bartłomiej Kocot authored
* Remove virtual destructors from unary ops * Fixes * Fixes * clang format fixes
-
rocking authored
-
Adam Osewski authored
* CK-Tile GEMM with memory bound pipeline. * Memory bound gemm pipeline. * Fix not closed namespace. * Block gemm mem pipeline draft. * Do not use ck_tile:: within ck_tile namespace. * Refactoring & Move Layout info to pipeline problem. * Get hot loop and TailNum information before lunching kernel. * Fixes in pipeline. * Add comment to load_tile_raw and change variable naming style. * Few small changes & formatting. * Do not use macro. * Add gtests. * Use AccDataType for Output of MFMA instruction. * Formatting. * Refactor gemm examples. * Switch over to current block gemm. * Use currently available pipeline policy. * Refactoring and review comment.s * Fixes after merge. * Add missing include. * Add load tile overload which accepts output tensor as parameter. * This give 8% perf boost at the cost of using more registers. * Rename example. * Small changes. * Fix compilation err and lower K. * Support different layouts for A/B * Fix vector size for different layouts. * Rename Alignment into VectorSize * Unblock tests.
-