Commits · ed3a2e52265e11daa366f47b082141a652b67c58 · yangql / composable_kernel-1

"vscode:/vscode.git/clone" did not exist on "c59cf48d7491a5d4fe512fb480d19bb8cf73743e"

10 Apr, 2023 1 commit

Groupnorm + swish external api (#668) · ed3a2e52

rocking5566 authored Apr 10, 2023

* Rename to proper naming

* Add example of groupnorm + swish

* Extract duplicate code in example

* Add groupnorm + swish instances

* Ractor instance generation, split into multiple cpp file

* Add external api and client example

* Refine profiler message

* Use ck math version of exp

* Refine problem size in example

* Add host version of exp

ed3a2e52

13 Aug, 2022 1 commit

Layernorm welford (#346) · 0bd6b842

rocking5566 authored Aug 13, 2022



* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

0bd6b842

30 Jun, 2022 1 commit

Standalone sweep once softmax kernel w/ ckProfiler (#295) · 93c99f3d

Anthony Chang authored Jul 01, 2022

* use 'sweep once' softmax kernel where applicable

* threadwise copy's dst buffer can specify invalid element value

* add int8 in/out float compute softmax support

give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error

* format

* softmax inherits DeviceNormalization

* softmax profiler stub

* tighten up reference softmax interface

* example prints tensor dimension

* add fp32 to softmax profiler

* rename header

* hook with ckProfiler

* format

* resolve merge conflict

* resolve merge conflicts

* update normalization profiler help string

* resolve conflict

* typo

* remove residual

* softmax profiler: address feedback

* test for mixed precision input/output

* fully qualify ck::math::isnan

* add comment for device normalization interface

* revise wording

* constness for alpha/beta scaler pointer

93c99f3d

25 Jun, 2022 2 commits

add license in file (#303) · d3051d75
Chao Liu authored Jun 24, 2022

d3051d75

Absolute include path (#281) · d1db6a0c

Chao Liu authored Jun 24, 2022

* ad gelu and fast_gelu

* added GeLU and fast GeLU

* clean up

* add gemm+fastgelu example

* add gemm+gelu instances

* update profiler

* clean up

* clean up

* adding gemm+bias+activation

* clean

* adding bias

* clean

* adding gemm multiple d

* debugging

* add gemm bias add fastgelu

* rename, clean

* refactoring; add readme

* refactor

* refactor

* refactor

* refactor

* refactor

* refactor

* fix

* fix

* update example

* update example

* rename

* update example

* add ckProfiler

* clean

* clean

* clean

* clean

* add client app example

* update readme

* delete obselete files

* remove old client app

* delete old file

* cleaning

* clean

* remove half

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path for all examples

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* fix header path

* revert client app example

* clean build

* fix build

* temporary disable client test on Jenkins

* clean

* clean

* clean

d1db6a0c

21 Jun, 2022 1 commit

Standalone softmax kernel (#284) · 15c89e81

Anthony Chang authored Jun 22, 2022

* initial stub for standalone softmax

* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m

* host softmax validates

* compiles; to implement beta scaling

* use NaN trick to efficiently ignore OOB values during sum of exponentials

* freeload device_reduce's utility functions

* clean up interface

* adding prior value (beta scaling)

* remove restriction related to perf considerations

* apply clang-format

* clean; disable diagnostics

* resolve conflicts

* add exp wrapper

* honor HostTensorDesc interface; allow implicit cast from different vector<T> type

* test softmax for fp16/fp32

* update readme

* amend commit NaN trick

* remove redundant param added during development

* format

* replace ScalarDataType with AccDataType

* separate out test programs by precision type

* move softmax sample code to its own folder

* format

* keep up with recent changes in reduction API

* remove extra header

15c89e81

09 Mar, 2022 1 commit

Reorganize files, Part 1 (#119) · 5d37d7bf

Chao Liu authored Mar 08, 2022

* delete obselete files

* move files

* build

* update cmake

* update cmake

* fix build

* reorg examples

* update cmake for example and test

5d37d7bf

19 Aug, 2021 1 commit

Composable kernel init integration v3 (#1097) · 6fe3627a

Chao Liu authored Aug 19, 2021

* Squashed 'src/composable_kernel/' content from commit f6edda61

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda61

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda61..5781adf5

5781adf5 Update develop (#5) (#6)
97e6d514 Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e refactor
49c33aae refactor
54b3e73d rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5



* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp
Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix
Co-authored-by: JD <Jehandad.Khan@amd.com>

6fe3627a

16 Aug, 2021 2 commits
- refactor · 67ad47e7
  Chao Liu authored Aug 16, 2021
  
  67ad47e7
- refactor · 16effa76
  Chao Liu authored Aug 16, 2021
  
  16effa76
27 Jul, 2021 1 commit

[MIOpen Downstream] Initial MIOpen integration (#52) · f63a23ac

Chao Liu authored Jul 27, 2021

* update online kernel wrapper bundle all descriptors in a tuple

* change __CONSTANT__ to CONSTANT

* rename

* adding tuning

* added IsValidCompileParameter

* reorginze

* adding tunable for fp16 and int8

* fix kernel compile warning and bug fixes

* suppress warning about cast CONSTANT (address space 4) pointer

* fix building issue

f63a23ac

01 Jul, 2021 1 commit

xdlops_v4r4_fwd fp32/fp16 (#34) · 3835318c

zjing14 authored Jul 01, 2021



* create files for xdlops

* working on blockwise_gemm_xdlops

* add KReduction

* add m/n repeats

* add 2x2 pipeline

* added 128x128 wavegemm

* use StaticBuffer of vector_type

* break vector type to blk_size

* add kpack into xldops_gemm and blockwise_gemm

* abroadcast only

* add fp32 mfma instructions

* adding fp16 mfma

* pack half4_t

* rename kperwave to kpack

* add 32x32x8fp16

* add fp16 mfma

* clean code

* clean code

* V4r4 xdlops kpack (#35)

* add kpack with incorrect results

* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2

* add 1x1 kernel

* add gridwise_gemm_v2 - single_buffer

* enabled dwordx4 for fp16
Co-authored-by: Chao Liu <chao.liu2@amd.com>

* refactor fwd-v4r4-xdlops

* add v4r4-nhwc-xdlop

* improve some perf of nhwc and nchw by tuning parameters, and change scheuduling in gridwise-gemm loop

* tweak scheduling in gridwise gemm

* add v4r3 with a single output copy

* init commit: output with slice win

* adding sliceWin

* add multiple repeats pattern

* starting adding bwd-v4r1-xdlops

* use tuple as SrcBuffer

* adding bwd-data v4r1 nhwc xdlops

* fix bug in make_dynamic_naive_tensor_descriptor_aligned_v2()

* fix bug in host bwd-data conv

* initial implementation of bwd-data v4r1 nhwc xdlops

* add launch bound flags

* enable launch bound

* add m/nrepeat=4

* tweak bwd-data v4r1 nhwc xdlops

* added bwd-data v4r1 nhwc xlops with output A and weight B

* add fwd-v4r4 nhwc xdlops, A input, B weight, C output
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3835318c

10 Jun, 2021 1 commit

Restructure gridwise and blockwise GEMM, add tensor contraction and FWD-v4r5 (#36) · 30072aec

Chao Liu authored Jun 09, 2021

* experimenting magic number division

* overhauling fwd-v4r4 to clearly reflect transformation graph

* added fwd-v4r5

* bug fix for make_dynamic_naive_tensor_descriptor_aligned_v2

* bug fix and added sanity-check in transform_dynamic_tensor_descriptor

* added conv_driver_v2

30072aec

11 May, 2021 1 commit

No raw index calculation (#31) · 01055d95

Chao Liu authored May 11, 2021



* Replace most raw index calculation to coordinate transformation
* Overhaul blockwise and threadwise GEMM
* Overhaul driver for gridwies GEMM kernel
Co-authored-by: Jing Zhang <jizhan@amd.com>

01055d95

25 Mar, 2021 1 commit

Dynamic tensor descriptor (#24) · fcbb9788

Chao Liu authored Mar 25, 2021



* support dynamic tensor descriptor

* use buffer load OOB feature for padding case

* add navi support

* add int8x4 inference kernel
Co-authored-by: Chao Liu <chao@ixt-rack-81.local.lan>
Co-authored-by: Jing Zhang <jizhan@amd.com>

fcbb9788

24 Jun, 2020 1 commit

Code clean up (#20) · 5c7cec11

Chao Liu authored Jun 23, 2020



* tuning para,

* testing on v100

* add fp16

* remove deprecated tensor descriptor

* sync with miopen

* update build script
Co-authored-by: Jing Zhang <jizhan@amd.com>

5c7cec11

17 Feb, 2020 1 commit
- MIopen integration (#13) · 1a66e35b
  Chao Liu authored Feb 17, 2020
```
* update for miopen integration: cosmetic refactor
```
  1a66e35b
27 Jan, 2020 1 commit
- Update for recent MIOpen integration (#11) · 3406a114
  Chao Liu authored Jan 27, 2020
```
* update for MIOpen integration
```
  3406a114
20 Jan, 2020 1 commit

Added bwd data v3r1 v4r1, tweaking v1 (#10) · c5da0377

Chao Liu authored Jan 20, 2020

* Added bwd data v3r1: breaking down compute into a series of load balanced GEMM, and launch in a single kernel
* Added bwd data v4r1: like v3r1, but launch GEMMs in multiple kernels
* Tweaked v1r1  and v1r2 (atomic) on AMD GPU

c5da0377

03 Dec, 2019 1 commit

backward data (#7) · 8f5f6496

Chao Liu authored Dec 03, 2019

* enabled atomic add in tensor copy
* added gridwise GEMM
* added backward data conv using GEMM + atomic
* added backward data conv using GEMM, no atomic

8f5f6496

10 Sep, 2019 1 commit
- adding merge transform · ca42e910
  Chao Liu authored Sep 10, 2019
  
  ca42e910
09 Sep, 2019 1 commit
- more utility code · 7a7fe160
  Chao Liu authored Sep 09, 2019
  
  7a7fe160
05 Sep, 2019 1 commit
- adding dimension tranformation · 0c05f427
  Chao Liu authored Sep 05, 2019
  
  0c05f427
20 Jun, 2019 1 commit
- refactor · 37b82b7e
  Chao Liu authored Jun 19, 2019
  
  37b82b7e
19 Jun, 2019 2 commits
- fixed amd build · 1f2cfceb
  Chao Liu authored Jun 19, 2019
  
  1f2cfceb
- refactor · 21f7e9f1
  Chao Liu authored Jun 19, 2019
  
  21f7e9f1
18 Jun, 2019 2 commits
- refactor · 9de63930
  Chao Liu authored Jun 18, 2019
  
  9de63930
- clean up for miopen · 23f633cd
  Chao Liu authored Jun 17, 2019
  
  23f633cd
17 Jun, 2019 2 commits
- refactoring · 9d59a39a
  Chao Liu authored Jun 17, 2019
  
  9d59a39a
- refactoring for miopen · 33d1e0e2
  Chao Liu authored Jun 17, 2019
  
  33d1e0e2
13 Jun, 2019 1 commit
- reorginzed files · 1566b317
  Chao Liu authored Jun 13, 2019
  
  1566b317
12 Jun, 2019 1 commit
- reorginze files · 81497a93
  Chao Liu authored Jun 11, 2019
  
  81497a93
11 Jun, 2019 2 commits
- rename files, added header guard, added namespace · 88b77181
  Chao Liu authored Jun 11, 2019
  
  88b77181
- remove .hip extension · 05e04665
  Chao Liu authored Jun 11, 2019
  
  05e04665
07 Jun, 2019 1 commit
- use more constexpr for Array · 0a386c46
  Chao Liu authored Jun 06, 2019
  
  0a386c46
06 Jun, 2019 1 commit
- refactor · 7a89684f
  Chao Liu authored Jun 06, 2019
  
  7a89684f
05 Jun, 2019 1 commit
- use more constexpr · 709f13a6
  Chao Liu authored Jun 04, 2019
  
  709f13a6
03 Jun, 2019 1 commit
- use vectorized read and write for threadwise generic tensor copy · 917d7a2b
  Chao Liu authored Jun 03, 2019
  
  917d7a2b
30 May, 2019 1 commit
- adding implicit gemm v4 (nchw, kcyx) · b2439ec9
  Chao Liu authored May 30, 2019
  
  b2439ec9
23 May, 2019 1 commit
- adding implicit gemm v3 · 8a4b5978
  Chao Liu authored May 22, 2019
  
  8a4b5978