Commits · cde0f4800596e4045a8365591f68d6449dea3703 · gaoqiong / composable_kernel_ROCM

26 Nov, 2024 1 commit
- Enable gemm kernel on all gfx9 architectures (#227) · 1084c64c
  Andriy Roshchenko authored Nov 25, 2024
  
  1084c64c
21 Nov, 2024 2 commits
- Narrowing the scope of PR to OCP FP8 enablement only · 3289a5c9
  Andriy Roshchenko authored Nov 21, 2024
  
  3289a5c9
- Cleanup · 97bad9f9
  Andriy Roshchenko authored Nov 21, 2024
  
  97bad9f9
20 Nov, 2024 3 commits
- Update year in copyright notice. · 8993448c
  Andriy Roshchenko authored Nov 20, 2024
  
  8993448c
- Fix GPU verification reporting logic. · 41b94703
  Andriy Roshchenko authored Nov 20, 2024
  
  41b94703
- Make fail/pass logic consistent within 01_gemm folder · 25c6d97b
  Andriy Roshchenko authored Nov 20, 2024
```
Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.
```
  25c6d97b
19 Nov, 2024 1 commit
- Address formatting issues and leftovers · 728032d7
  Andriy Roshchenko authored Nov 19, 2024
  
  728032d7
14 Nov, 2024 2 commits

Fix gfx1101 build · 8209d54c
Andriy Roshchenko authored Nov 14, 2024

8209d54c

Fix example_convnd_fwd_max_xdl_int8 failures on MI300 (#1666) · d805a461

Andriy Roshchenko authored Nov 14, 2024

* Improve test verbosity.

* BUGFIX: Add missing initialization for reduction buffer

* Change default initialization method

Performance may be affected for fp32 and int8 examples.

* Improve test verbosity

* Cleanup

d805a461

12 Nov, 2024 1 commit

[CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651) · 2b6458dd

Thomas Ning authored Nov 12, 2024

* Finished the feature

* Modified the test file

* Test case update

* addresss comment

* Addressed the review comment

* Fixed the CI error

2b6458dd

11 Nov, 2024 1 commit

[CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650) · 8ef8a994

valarLip authored Nov 11, 2024

* [CK_TILE] add more stride for layernorm to support un-continuous Tensor

* align CK coding style

* extend strides to layernrom expample

* clang-format...

8ef8a994

09 Nov, 2024 2 commits

Ck tile/moe sorting (#1624) · bec6fbc6

dummycoderfe authored Nov 09, 2024



* add moe_sorting & check ok

* fix comments & typo

* Run remod.py under include/ck_tile & example/ck_tile directories

* format codes

* fix output ci check bug

* fix moe sorting readme and error commit file

* use magiv div to accelerate compute

* add an loop unroll for moe lds ops

* add extblocksnel to set zeros for moebufs

* [Ck_tile] moe set zero run ok, add size check and fix ref check

* [Ck_tile]fix moe_sorting fuse set_zero remod

* [Ck_tile] change name style, fix zero buffer size err, change folder

* [Ck_tile] moe_sorting: fix name style

* [Ck_tile] moe_sorting, remove useless params in traits

* [Ck_tile] change outputtile cnt * unit_size; change output buf alloc

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

bec6fbc6

Fix 'sh' command compatibility of smoke_test_fwd.sh (#1553) · af9546d9
Po Yen Chen authored Nov 09, 2024

af9546d9

08 Nov, 2024 3 commits

Fix data types and improve testing verbocity. · 61b20afa
Andriy Roshchenko authored Nov 08, 2024

61b20afa
Verify more tests on floating point data · 51b9abb9
Andriy Roshchenko authored Nov 08, 2024

51b9abb9

[Ck tile] layernorm2d fwd optimize (#1637) · 686a58a9

dummycoderfe authored Nov 08, 2024



* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>

686a58a9

07 Nov, 2024 6 commits
- Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers · 646b8e5c
  Andriy Roshchenko authored Nov 07, 2024
  
  646b8e5c
- enable compilation for generic navi targets (#1645) · 75c5bfa3
  Illia Silin authored Nov 07, 2024
  
  75c5bfa3
- Verify 20_grouped_conv_bwd_weight on floating point numbers · 405fdaec
  Andriy Roshchenko authored Nov 07, 2024
  
  405fdaec
- Verify 04_gemm_add_add_fastgelu on floating point numbers · ff6bbf40
  Andriy Roshchenko authored Nov 07, 2024
  
  ff6bbf40
- Verify 35_splitk_gemm on floating point numbers. · 52cd7ade
  Andriy Roshchenko authored Nov 07, 2024
```
splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.
```
  52cd7ade
- Introduce two new tensor generators · 1fb3bb8d
  Andriy Roshchenko authored Nov 07, 2024
  
  1fb3bb8d
06 Nov, 2024 2 commits
- Make sure all tests and examples are built for gfx950 · 2eb1ba44
  Andriy Roshchenko authored Nov 06, 2024
  
  2eb1ba44
- Change default verification method to CPU. · 360dd17a
  Andriy Roshchenko authored Nov 06, 2024
```
GPU verification takes too much time to complete on the emulator.
```
  360dd17a
05 Nov, 2024 4 commits

Prevent instantiation of undefined FP8 operators. (#1639) · 365f39ae
Andriy Roshchenko authored Nov 05, 2024

365f39ae
Fix test success reporting logic · 7b8e2cf6
Andriy Roshchenko authored Nov 05, 2024

7b8e2cf6

Make sure cmake can handle the xnack+/xnack- targets. (#1633) · b6e74be1

Illia Silin authored Nov 05, 2024

* make sure cmake can handle xnack targets

* dont build xdl instances for gfx906:xnack-

* dont build xdl tests for gfx906:xnack-

b6e74be1

[generate.py] Override blob list if it already exists (#1635) · 464abd23

Juan Manuel Martinez Caamaño authored Nov 05, 2024

Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.

Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.

464abd23

04 Nov, 2024 1 commit
- Prevent instantiation of operators that are not supported by FP8 data types · 1ccb8112
  Andriy Roshchenko authored Nov 04, 2024
  
  1ccb8112
02 Nov, 2024 1 commit

[CK_TILE] layernorm have more accurate residual (#1623) · cb6c5d39

carlushuang authored Nov 02, 2024



* more accurate residual

* modify comment

* Fix literal case in README.md

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

cb6c5d39

01 Nov, 2024 2 commits

[Ck_tile] smoothquant (#1617) · fbd65454

rocking authored Nov 01, 2024



* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

fbd65454

[layernorm] hot fix (#1620) · 550248de
carlushuang authored Nov 01, 2024
```
* hot fix ln

* some rename
```
550248de

31 Oct, 2024 2 commits

Bugfixes on gfx1101 architecture. · 6e6a3bc6
Andriy Roshchenko authored Oct 31, 2024

6e6a3bc6

[CK_TILE] layernorm support fused-quant/fused-add (#1604) · c3a4800c

carlushuang authored Oct 31, 2024

* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment

c3a4800c

30 Oct, 2024 3 commits

[CK-Tile] Universal gemm memory bound pipeline (#1558) · 24d996aa

Adam Osewski authored Oct 30, 2024

* CK-Tile GEMM with memory bound pipeline.

* Memory bound gemm pipeline.

* Fix not closed namespace.

* Block gemm mem pipeline draft.

* Do not use ck_tile:: within ck_tile namespace.

* Refactoring & Move Layout info to pipeline problem.

* Get hot loop and TailNum information before lunching kernel.

* Fixes in pipeline.

* Add comment to load_tile_raw and change variable naming style.

* Few small changes & formatting.

* Do not use macro.

* Add gtests.

* Use AccDataType for Output of MFMA instruction.

* Formatting.

* Refactor gemm examples.

* Switch over to current block gemm.

* Use currently available pipeline policy.

* Refactoring and review comment.s

* Fixes after merge.

* Add missing include.

* Add load tile overload which accepts output tensor as parameter.

* This give 8% perf boost at the cost of using more registers.

* Rename example.

* Small changes.

* Fix compilation err and lower K.

* Support different layouts for A/B

* Fix vector size for different layouts.

* Rename Alignment into VectorSize

* Unblock tests.

24d996aa

[Ck tile] support rmsnorm and related fusion (#1605) · 3d609534

rocking authored Oct 30, 2024

* Add reduce2d new api

* Prevent user use cross warp reduction

* Fix bug of std caculation

* Add rmsnorm2d

* Add rmsnorm small example

* Remove static assert to prevent compile fail

* Add script to test performance and correctness

* Add missing cmake change

* refine naming

* refine example of rmsnorm

* Fix bug of rmsnorm

* Refine naming

* Fix cmake

* clang format

* Refine pipeline name

* Add add_rmsnorm2d_rdquant kernel

* Add reduce op

* host verification

* Fix bug of one pass pipeline

* Refine tile size

* Add two pass pipeline

* Rename two pass to three pass

* Fix bug of kSaveX == false

* Add instance library

* Add test script

* Fix bug of x verification

* Add save_x to trait

* Add README

* Move reduce2d into reduce folder

* Fix bug of welford when number of m warp > 1

* remove reduncant comment

* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant

* clang format and add missing header

* Add host validation of add + layernorm2d + rsquant

* Revert "Add host validation of add + layernorm2d + rsquant"

This reverts commit 936cb457978b928b90eff89a08fcdb7dc8bbed67.

* Remove deprecated flag

3d609534

[CK_TILE] Add fmha fwd headdim96 support (#1608) · 86322218

Qianfeng authored Oct 30, 2024



* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

86322218

29 Oct, 2024 2 commits
- [CK_TILE] add generic_permute (#1607) · 9fbd72e9
  valarLip authored Oct 29, 2024
  
  9fbd72e9
- Extend GeneratorTensor_Sequential to produce values of prescribed data types. · b206fb26
  Andriy Roshchenko authored Oct 29, 2024
  
  b206fb26
26 Oct, 2024 1 commit

topk_softmax (#1592) · b098b71b

carlushuang authored Oct 26, 2024

* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)

b098b71b