Commits · 2d9017d16d764c5c54e9547b1534cdcd0701bc9d · gaoqiong / composable_kernel_ROCM

21 Oct, 2024 11 commits
- Add reduce example · 2d9017d1
  rocking authored Oct 21, 2024
  
  2d9017d1
- fix 2p pass · d7110645
  carlushuang authored Oct 21, 2024
  
  d7110645
- Use smaller N · d5d7de90
  rocking authored Oct 21, 2024
  
  d5d7de90
- parallel compilation · 2bc76625
  rocking authored Oct 21, 2024
  
  2bc76625
- Fix incorrect type · cbe1e304
  rocking authored Oct 21, 2024
  
  cbe1e304
- format · 0440f8bd
  letaoqin authored Oct 21, 2024
  
  0440f8bd
- Add two pass pipeline · 93ec1681
  rocking authored Oct 21, 2024
  
  93ec1681
- add n1536 · 4cef2fc5
  carlushuang authored Oct 21, 2024
  
  4cef2fc5
- support up to 2048 now · d56b41fd
  carlushuang authored Oct 21, 2024
  
  d56b41fd
- update instance for 2048, 1024 case · a30bd5da
  carlushuang authored Oct 21, 2024
  
  a30bd5da
- [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577) · 95e722a3
  Po Yen Chen authored Oct 21, 2024
```
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
```
  95e722a3
20 Oct, 2024 3 commits
- add more block-per-tile instance · 9d13f91b
  carlushuang authored Oct 20, 2024
  
  9d13f91b
- rename more · 1cb3e443
  carlushuang authored Oct 20, 2024
  
  1cb3e443
- refactor layernorm2d pipeline and add block-per-block utility · 5cfd751b
  carlushuang authored Oct 20, 2024
  
  5cfd751b
17 Oct, 2024 4 commits
- remove redundant line · 68e67701
  rocking authored Oct 17, 2024
  
  68e67701
- Refine file name · 1c870158
  rocking authored Oct 17, 2024
  
  1c870158
- Refine variable name · 3b290001
  rocking authored Oct 17, 2024
  
  3b290001
- clang format · d5efa5e5
  rocking authored Oct 17, 2024
  
  d5efa5e5
16 Oct, 2024 10 commits
- Add kMThreadPerBlock to template parameter · 98395085
  rocking authored Oct 16, 2024
  
  98395085
- Refine arg of operator() · 03247367
  rocking authored Oct 16, 2024
  
  03247367
- Extract common function · 5c736bc1
  rocking authored Oct 16, 2024
  
  5c736bc1
- Support bf16 · b894487c
  rocking authored Oct 16, 2024
  
  b894487c
- refactor api · 4e14a894
  rocking authored Oct 16, 2024
  
  4e14a894
- Remove fp32 instances · 629257f9
  rocking authored Oct 16, 2024
  
  629257f9
- Fix order of parameter · 28f7629d
  rocking authored Oct 16, 2024
  
  28f7629d
- Remove save mean and inv std · d62f0358
  rocking authored Oct 16, 2024
  
  d62f0358
- Rename file · 29cff07e
  rocking authored Oct 16, 2024
  
  29cff07e
- unify layernorm api · abe875d6
  rocking authored Oct 16, 2024
  
  abe875d6
15 Oct, 2024 1 commit
- [CK_TILE] Add block universal gemm pipeline policy (#1557) · d02a92cc
  Bartłomiej Kocot authored Oct 15, 2024
```
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
```
  d02a92cc
14 Oct, 2024 2 commits
- Add transpose scale amax example (#1547) · f21cda25
  Bartłomiej Kocot authored Oct 14, 2024
```
* Add transpose scale amax example

* fixes

* Tune reduce instance
```
  f21cda25
- 1. Add save mean and save std back · 96568141
  rocking authored Oct 14, 2024
```
2. Move construction of tensor_view and tile_window to operator()
```
  96568141
12 Oct, 2024 2 commits
- Update warpshuffle · e0b473b6
  aska-0096 authored Oct 12, 2024
  
  e0b473b6
- port layernorm · 63214d01
  letaoqin authored Oct 12, 2024
  
  63214d01
10 Oct, 2024 2 commits

Fix default stride value (#1559) · d18fc079
Rostyslav Geyyer authored Oct 10, 2024

d18fc079

Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535) · 6f27bc98

Thomas Ning authored Oct 10, 2024



* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>

6f27bc98

08 Oct, 2024 3 commits

Add a gpu gemm reference kernel (#1528) · aa932445

Rostyslav Geyyer authored Oct 08, 2024



* Add a gpu gemm reference kernel

* Switch to gpu reference in gemm examples

* Remove redundant arguments

* Update all related examples

* Update more examples

* Try less threads per block

* Try even less threads per block

* Add support for all matrix layouts

* Increase block size

* Clean up

* Remove hardcoded strides

* Clean up

* Try a column-major case

* Revert back to row-major

* Run both CPU and GPU veriffication

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

aa932445

[CK_TILE] Update example README files & fix script compatibility issue (#1548) · 0c094daa

Po Yen Chen authored Oct 08, 2024

* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

0c094daa

[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549) · 74d68e3b

Qianfeng authored Oct 08, 2024



* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

74d68e3b

07 Oct, 2024 2 commits

Fix build logic using GRU_ARCHS. (#1536) · 7d8ea5f0

Illia Silin authored Oct 07, 2024

* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

7d8ea5f0

[Ck tile] Support layernorm one pass (#1512) · 0023f01a

rocking authored Oct 07, 2024



* Fix compile error

* Add one pass pipeline

* Extract creating tile_window to operator()

* clang format

* reduce duplicated code

* do not hardcode

* Support padding in layernorm

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0023f01a