Commits · 883d9c6c7593bf2a83f5e1d51719ada26e560426 · gaoqiong / composable_kernel_ROCM

21 Nov, 2024 3 commits
- Fix linker errror · 883d9c6c
  Rostyslav Geyyer authored Nov 21, 2024
  
  883d9c6c
- Fix · c7345e52
  Rostyslav Geyyer authored Nov 21, 2024
  
  c7345e52
- Add vector conversions · a8cd34d6
  Rostyslav Geyyer authored Nov 21, 2024
  
  a8cd34d6
19 Nov, 2024 1 commit
- Fix · b5ac2abd
  Rostyslav Geyyer authored Nov 19, 2024
  
  b5ac2abd
18 Nov, 2024 2 commits
- Add tensor generators · 1475cb44
  Rostyslav Geyyer authored Nov 18, 2024
  
  1475cb44
- Add check_err function · 35a32da2
  Rostyslav Geyyer authored Nov 18, 2024
  
  35a32da2
15 Nov, 2024 1 commit
- Add vector types and tests · 37072aac
  Rostyslav Geyyer authored Nov 15, 2024
  
  37072aac
08 Nov, 2024 2 commits
- Add debug tests · 1bc375e9
  Rostyslav Geyyer authored Nov 08, 2024
  
  1bc375e9
- Add fp4 vectors · aa1920da
  Rostyslav Geyyer authored Nov 08, 2024
  
  aa1920da
07 Nov, 2024 2 commits
- Format · 9433306a
  Rostyslav Geyyer authored Nov 07, 2024
  
  9433306a
- Merge branch 'gfx950' into lwpck-2390 · 630042d8
  Rostyslav Geyyer authored Nov 07, 2024
  
  630042d8
06 Nov, 2024 6 commits
- Add device conversions · 5f1a24a8
  Rostyslav Geyyer authored Nov 06, 2024
  
  5f1a24a8
- Add scaled conversions with tests · 1bca7134
  Rostyslav Geyyer authored Nov 06, 2024
  
  1bca7134
- Add scale <-> float conversions · 0bb6e25f
  Rostyslav Geyyer authored Nov 06, 2024
  
  0bb6e25f
- sync from public repo · 261d76c4
  illsilin authored Nov 06, 2024
  
  261d76c4
- sync from public repo · a4522ae3
  illsilin authored Nov 06, 2024
  
  a4522ae3
- Merge pull request #214 from ROCm/merge_from_public · e0594d08
  Illia Silin authored Nov 06, 2024
```
Merge from public
```
  e0594d08
05 Nov, 2024 7 commits

merge from public repo · 667cd6ab
illsilin authored Nov 05, 2024

667cd6ab
Prevent instantiation of undefined FP8 operators. (#1639) · 365f39ae
Andriy Roshchenko authored Nov 05, 2024

365f39ae
remove gfx940;gfx941 from default target lists (#1640) · 54440cf5
Illia Silin authored Nov 05, 2024

54440cf5
Statically Cast Pointer Offset (#1631) · d0e3a70a
darren-amd authored Nov 05, 2024
```
* explicit cast ptr offset

* formating change
```
d0e3a70a

Make sure cmake can handle the xnack+/xnack- targets. (#1633) · b6e74be1

Illia Silin authored Nov 05, 2024

* make sure cmake can handle xnack targets

* dont build xdl instances for gfx906:xnack-

* dont build xdl tests for gfx906:xnack-

b6e74be1

[generate.py] Override blob list if it already exists (#1635) · 464abd23

Juan Manuel Martinez Caamaño authored Nov 05, 2024

Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.

Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.

464abd23

Linsun/convint8 fwd instances (#1626) · 0c9012fb

Lin Sun authored Nov 04, 2024



Add instances for int8 grouped conv2d fwd
---------
Co-authored-by: root <root@dell300x-pla-t28-03.pla.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

0c9012fb

04 Nov, 2024 2 commits
- Temporary disable part of dynamic op conv instances (#1630) · 4f1fdbb6
  Bartłomiej Kocot authored Nov 04, 2024
```
* Temporary disable part of dynamic op conv instances

* fix
```
  4f1fdbb6
- Add stochastic rounding tests · 4c47048f
  Rostyslav Geyyer authored Nov 04, 2024
  
  4c47048f
02 Nov, 2024 1 commit

[CK_TILE] layernorm have more accurate residual (#1623) · cb6c5d39

carlushuang authored Nov 02, 2024



* more accurate residual

* modify comment

* Fix literal case in README.md

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

cb6c5d39

01 Nov, 2024 5 commits

Merge remote-tracking branch 'origin/develop' into gfx950 · 7da48908
Andriy Roshchenko authored Nov 01, 2024

7da48908

Reduce build time. (#1621) · 03c6448b

Illia Silin authored Oct 31, 2024

* disable fp8 gemm_universal on gfx90a and gfx908 by default

* fix cmake syntax

* fix clang format

* add ifdefs in amd_xdlops

* disable fp8 gemm instances on gfx90a by default

* update readme

03c6448b

[Ck_tile] smoothquant (#1617) · fbd65454

rocking authored Nov 01, 2024



* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

fbd65454

[layernorm] hot fix (#1620) · 550248de
carlushuang authored Nov 01, 2024
```
* hot fix ln

* some rename
```
550248de
Merge pull request #209 from ROCm/andriy/merge_from_public · 7d50244e
Illia Silin authored Oct 31, 2024
```
Update develop branch from public repository
```
7d50244e

31 Oct, 2024 2 commits

Merge remote-tracking branch 'ck_public/develop' into andriy/merge_from_public · d51701d4
Andriy Roshchenko authored Oct 31, 2024

d51701d4

[CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c

carlushuang authored Oct 31, 2024

* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment

c3a4800c

30 Oct, 2024 6 commits

Fix typo · b73f83fd
Rostyslav Geyyer authored Oct 30, 2024

b73f83fd
Add conversion tests · d3c89355
Rostyslav Geyyer authored Oct 30, 2024

d3c89355
Update conversions · cf7e20a8
Rostyslav Geyyer authored Oct 30, 2024

cf7e20a8
Remove virtual destructors from unary ops (#1610) · 9a8a5213
Bartłomiej Kocot authored Oct 30, 2024
```
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
```
9a8a5213
clang-format (#1612) · 7d911154
rocking authored Oct 30, 2024

7d911154

[CK-Tile] Universal gemm memory bound pipeline (#1558) · 24d996aa

Adam Osewski authored Oct 30, 2024

* CK-Tile GEMM with memory bound pipeline.

* Memory bound gemm pipeline.

* Fix not closed namespace.

* Block gemm mem pipeline draft.

* Do not use ck_tile:: within ck_tile namespace.

* Refactoring & Move Layout info to pipeline problem.

* Get hot loop and TailNum information before lunching kernel.

* Fixes in pipeline.

* Add comment to load_tile_raw and change variable naming style.

* Few small changes & formatting.

* Do not use macro.

* Add gtests.

* Use AccDataType for Output of MFMA instruction.

* Formatting.

* Refactor gemm examples.

* Switch over to current block gemm.

* Use currently available pipeline policy.

* Refactoring and review comment.s

* Fixes after merge.

* Add missing include.

* Add load tile overload which accepts output tensor as parameter.

* This give 8% perf boost at the cost of using more registers.

* Rename example.

* Small changes.

* Fix compilation err and lower K.

* Support different layouts for A/B

* Fix vector size for different layouts.

* Rename Alignment into VectorSize

* Unblock tests.

24d996aa