Commits · a59e8d487d7eeb33ce7863ff040361515807cd7b · gaoqiong / composable_kernel

28 Jul, 2023 1 commit
- bwd qloop 2 kernels update mask · 7b915a10
  danyao12 authored Jul 28, 2023
  
  7b915a10
26 Jul, 2023 5 commits

initial stream-k implementation with example (#699) · e7dca79d

carlushuang authored Jul 27, 2023



* initial stream-k implementation with example

* fix unexpected change in err

* improve a little bit performance by reorganize pipeline.

* improve perf a little bit by swizzle block idx

* add profiler

* update example

* fix spelling

* shrink karg for streamk

* support dynamic buffer using memory coherence glc_slc bit from template

* control memory coherence while construct dynamic buffer

* update reduction for streamk(not ready yet)

* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting

* fix build issue

* fix several bug

* now result is correct, everything works (but has scratch)

* remove scratch by manually reset coordinate

* update device code

* fix a bug in final reduce

* fix something in example

* update async memset

* fix enum as camel case

* modify coherence enum name

* clean code and use atomic streamk by default

* remove unused var

* throw exception if have empty pointer

* fix format

* fix CI warning

* fix type in init

* modify CI error

* filter out on gfx10+

* restore changed example code

---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

e7dca79d

Disable XDL kernels on unsupported HW Add ck::is_xdl_supported (#768) · ac6d68b3

Bartłomiej Kocot authored Jul 26, 2023



* Disable XDL kernels on unsupported HW; Add ck::is_xdl_supported function (#765)

* Do not throw an error when GEMM problem is not supported.

---------
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

ac6d68b3

fix bugs and optimize bwd qloop 2 kernels · 35b2971e
danyao12 authored Jul 26, 2023

35b2971e
fix triagle name · 5af78ac2
ltqin authored Jul 26, 2023

5af78ac2
Refine the dimension of host tesnor. This example only require 1D (#812) · 016bd428
rocking authored Jul 26, 2023

016bd428

25 Jul, 2023 6 commits
- fix example call C0MatrixMask(N) · 4a653a5d
  ltqin authored Jul 25, 2023
  
  4a653a5d
- change enum to MaskUpperTringleFrom · 321b6c8e
  ltqin authored Jul 25, 2023
  
  321b6c8e
- fix name · a3f11fe9
  ltqin authored Jul 25, 2023
  
  a3f11fe9
- fix example · d5f629e7
  ltqin authored Jul 25, 2023
  
  d5f629e7
- remove temporary codes&files · 86717157
  danyao12 authored Jul 25, 2023
  
  86717157
- Add bias scalar vectorload = 1 for gemm bias gemm (#791) · 50643dd5
  ltqin authored Jul 25, 2023
```
* first change bias load

* add bias dim and scalervector parameter

* make CDE0BlockTransferSrcVectorDim not work

* changse toinstance

* add limit for CDE0BlockTransferSrcScalarPerVector
```
  50643dd5
24 Jul, 2023 1 commit
- add from botton right mask · 92b9b046
  ltqin authored Jul 24, 2023
  
  92b9b046
18 Jul, 2023 1 commit

Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9

Illia Silin authored Jul 17, 2023

* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line

189ea3b9

17 Jul, 2023 1 commit
- remove useless parameters · 1128cd3a
  danyao12 authored Jul 17, 2023
  
  1128cd3a
15 Jul, 2023 1 commit
- remove useless templates · 5571be9d
  danyao12 authored Jul 15, 2023
  
  5571be9d
14 Jul, 2023 2 commits
- remove example_batched_multihead_attention_backward_v2_phased · 5ba30232
  danyao12 authored Jul 14, 2023
  
  5ba30232
- add check_integer_err · 0f1a6b97
  danyao12 authored Jul 14, 2023
  
  0f1a6b97
13 Jul, 2023 1 commit
- update copyright headers · 75fd187d
  danyao12 authored Jul 13, 2023
  
  75fd187d
12 Jul, 2023 1 commit

Support NHWGC conv2d_bwd_weight (#769) · 1ee99dca

Bartłomiej Kocot authored Jul 12, 2023



* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1ee99dca

10 Jul, 2023 1 commit
- add padding code for M · be38f68d
  ltqin authored Jul 10, 2023
  
  be38f68d
07 Jul, 2023 2 commits
- fix bug for group Block2CTileMap · a188073b
  ltqin authored Jul 07, 2023
  
  a188073b
- group remove y_grid_desc_mblock_mperblock_oblock_operblock parameter · 0b472e28
  ltqin authored Jul 07, 2023
  
  0b472e28
06 Jul, 2023 3 commits

Batchnorm splitk single kernel (#771) · 8f5cafaf

Qianfeng authored Jul 06, 2023

* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file

8f5cafaf

Move Device Ops implementations into impl directory. (#777) · f4dfc060
Adam Osewski authored Jul 06, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
f4dfc060
change some functions name · dcfe312b
ltqin authored Jul 06, 2023

dcfe312b

05 Jul, 2023 2 commits
- add group · a71a3f65
  ltqin authored Jul 05, 2023
  
  a71a3f65
- Add fp8 GEMM and an example for it (#767) · 1cf50031
  Rostyslav Geyyer authored Jul 04, 2023
```
* Add fp8 xdl gemm

* Add example

* Use int8 intrinsics for buffer load/store

* Format

* Update cmakelists
```
  1cf50031
04 Jul, 2023 1 commit
- add DDattype and DKPerBlock parameter to device · 5938d555
  ltqin authored Jul 04, 2023
  
  5938d555
03 Jul, 2023 1 commit
- first · 2416ddf7
  ltqin authored Jul 03, 2023
  
  2416ddf7
30 Jun, 2023 1 commit
- rename device ops · 6cc7d0de
  danyao12 authored Jun 30, 2023
  
  6cc7d0de
25 Jun, 2023 1 commit
- modify comment · 00cb7e41
  danyao12 authored Jun 25, 2023
  
  00cb7e41
19 Jun, 2023 4 commits

do not build gemm-gemm and conv-conv examples for gfx94* (#761) · 645eb2f2

Illia Silin authored Jun 19, 2023

* do not build gemm-gemm and conv-conv examples for gfx94*

* do not build gemm-gemm and conv-conv examples on navi

645eb2f2

Maxpool bwd (#750) · 341ad956

rocking authored Jun 19, 2023

* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail

341ad956

rename all fwd related files · 6d63c311
danyao12 authored Jun 19, 2023

6d63c311
rename all bwd related files · 656ebe9a
danyao12 authored Jun 19, 2023

656ebe9a

16 Jun, 2023 4 commits
- rename kloop bwd files · f74fa9ec
  danyao12 authored Jun 16, 2023
  
  f74fa9ec
- rename v3 to qloop_t2b_phased_pt1 · 498b7bac
  danyao12 authored Jun 16, 2023
  
  498b7bac
- update CMakeLists · e38d2a5d
  danyao12 authored Jun 16, 2023
  
  e38d2a5d
- add grouped train v2 · b4a995b8
  danyao12 authored Jun 16, 2023
  
  b4a995b8