Commits · 9608beee990978d4086f49fc46c8f43ee60cc43c · gaoqiong / composable_kernel

04 Nov, 2022 1 commit
- removed most of the extraneous code, testing with different dimensions · d179a12a
  Astha Rai authored Nov 04, 2022
  
  d179a12a
03 Nov, 2022 2 commits

Fused elementwise normalization (#492) · 8a4253ba

guangzlu authored Nov 04, 2022

* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp

8a4253ba

integrated variable for thread distribution into device elementwise and added... · b4abe4e2
Astha Rai authored Nov 03, 2022
```
integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise
```
b4abe4e2

02 Nov, 2022 6 commits

Refine layernorm naming and test code (#497) · d4d1147f

rocking5566 authored Nov 03, 2022

* Sync the naming

* Sync the test of layernorm with groupnorm

* Sync the naming

* Minor change for comment and log

* [What] Add saveMean and SaveInvVariance in the interface.
[Why] These can optimize the backward

d4d1147f

remove atten kernel workarounds as we move over to rocm 5.3 (#496) · 451f1e3d
Anthony Chang authored Nov 03, 2022

451f1e3d

Add client example of grouped conv2d backward data (data type: fp16) (#481) · 9e57a290

Po Yen Chen authored Nov 03, 2022

* Improve example reusability

* Remove no-longer used file

* Rename folder of grouped_conv_bwd_data example

* Add normal grouped conv bwd example

* Add interface 'DeviceGroupedConvBwdData'

* Prettify comment of device op type arguments

* Add grouped conv2d/conv3d backward data fp16 instances

* Fix wrong template argument

* Add grouped_conv2d_bwd_data client example

* Use simpler expression to calculate memory size

* Fix formating

* Remove grouped_conv3d_bw_data instances

Underlying device operator is not ready to handle 3D input

* Remove no-longer necessary include directive

* Add missing include directive

* Use more realistic conv param in example

9e57a290

Add pipeline v1/v2 selector, add more instances (#381) · 1a0b0e7b

Rostyslav Geyyer authored Nov 02, 2022



* Add gridwise gemm pipeline v1/v2 selector

* Pipeline selector working, test-wise add pipeline options to one instance

* Add gemm instances

* Add debug info to DeviceGemmXdl

* Add debug info to DeviceGemmXdl_CShuffle

* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu

* Minor fix

* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm

* set up inter-wave configuration

* use defualt loop scheduling for supported gemm ops

for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI

* Add enum PipelineVersion

* Update instances

* Format

* Fix the merge conflict

* Add flags to disable added instances

* Test disable flag check

* Disable flag check

* Enable the instances
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

1a0b0e7b

Softmax unit-test reduction across all and non innermost dims cases. (#406) · 6d8614ee

Adam Osewski authored Nov 02, 2022



* Add reduction across all dims cases.

* host softmax: handle all reduce

* Test cases when reduced dim is not innermost axis.

* Fix syntax.

* Test non innermost dim for fp32 and int8

* Group test suites wrt NumReduceDim.

* Additionally test failing cases.

* Throw error when Rank or NumReduceDims doesn't match arguments.

* Check reducedDims has correct values

* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).

* Reorganize unit tests to better cover use scenarios.

* Test input validation
* Test reduction of inner dimensions with custom op instances.

* Refactor fp32 and int8 unit tests.

* Fix FP32 instance template parameters.

* Add more instances.

* Instances with InSrcVectorDim=0.

* Do not initialize and copy data when arg not supported.

* ckProfiler Softmax use instance factory.

* Refactor device softmax IsSupported.

* Additionally add non-polymorphic api functions

* Split softmax instances into multiple files.

* Fix profiler.

* Reorganize tests to reuse profiler and cover edge cases.

* Clang-format

* I8 Softmax instances along with UT.

* Reuse type alias definitions from instance factory header.

* Clean included headers

* Fix variable names.

* Add missing checks in Argument constructor.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

6d8614ee

Conv perlayer int8 quantization (#471) · 226bc02b

rocking5566 authored Nov 03, 2022

* Add conv2d requant example

* Fix bash error

* Rename example

* 1. Rename gemm quantization
2. shares the requantization lambda function with conv

* Refine declare type

* Add conv bias relu quantization exmaple

* clang format

* Fix compile error due to merge develop

* Fix CI error

* Extract quantization post operation into another file

* Support quantization for non piecewise linear function

* Add instance for conv quantization

* Add convolution quantization factory

* Add convolution quantization client example

* Add more instances with different template parameters

* clang format

* Sync the naming with the develop

226bc02b

01 Nov, 2022 1 commit
- added variables to distribute threads through both dimensions · 10947a54
  Astha Rai authored Nov 01, 2022
  
  10947a54
31 Oct, 2022 1 commit

Add Conv Forward on Navi21 for ResNet50 (#490) · 8ee36118

ltqin authored Nov 01, 2022



* add device of dl

* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3

* init version for dl conv

* add example(init)

* result right

* disable elementwise operation

* check parameters

* add fp32,int8 example and change check code

* change deive file and class name

* add check vector access of C

* add instance

* add to ckProfiler

* add Filter1x1Pad0 instances

* fix ignore error

* fix for CI
Co-authored-by: letaoqin <letaoqin@amd.com>

8ee36118

28 Oct, 2022 1 commit

Batchnorm-forward implemented using welford method to calculate variance (#403) · 7fa892e6

Qianfeng authored Oct 28, 2022



* Update to the batchnorm-forward API and base class

* Fix leeked header including in gridwise_set_buffer_value.hpp

* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction

* Update to the batchnorm-forward example to use the new batchnorm-forward device interface

* Change the batchnorm-forward reference to use sequential welford method

* Change to assign the workspace into four buffers in the host layer

* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford

* Tiny correction and remove un-used file under example/34_batchnorm

* Renaming in the kernel arguments

* Explicitly use ck::math::sqrt in batchnorm-forward kernels

* Add some comments to some kernels

* Tiny fix

* Generalize the data types in reference_batchnorm_forward_nhwc_c

* Use ck::ignore to mark un-used parameters

* Move GetReduceCountPerThread functor codes from kernel to device

* Remove some un-used codes in device_batchnorm_forward_impl.hpp

* Tiny fix in batchnorm_forward example

* Move GetReduceCountPerThread() to welford_helper.hpp

* Use seperate data type for Scale and Bias

* Renaming in device Op

* Tiny fix in forward example

* Updata to batchnorm-infer (type spliting, renaming)

* Add time and bandwidth measurement to the batchnorm-forward example

* Add support of elementwise operation for batchnorm forward output

* Reduce object copying by passing object as reference type

* Tiny change for performance

* Updates for performance again

* Some Renamings

* Add GetActualVariance template parameter for ThreadwiseWelfordMerge

* Tiny update in reference batchnorm forward nhwc/c

* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory

* Fuse mean and bias in the normalization calculation
Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

7fa892e6

27 Oct, 2022 1 commit

Input/output permutation for fused attention (#460) · de37550f

Anthony Chang authored Oct 28, 2022



* reopen masking att instance due to CI is upgraded

* re-enable instances previously failed on 9110

* enable ksize-kpadding pair validity test

* add non-masked attention+permute test; expose masking boolean to attention kernel handles

* disable bench

* fix test

* move files

* bulk rename batched_gemm_masking_scale_softmax_gemm_permute to batched_gemm_softmax_gemm_permute

* format

* amend rename

* disable bench in test

* add mask/no-mask test for non-permute attention kernels

* disable broken kernel instance

* example working

add non-permuted problem statement

evaluating whether overhead comes from permutation or the extra kernel arg

* interface for bias addition without implementing it

* test and profiler running

* tidy

* mask type determined by enum class

* unify example code

* move masking specialization to its own header

* align formats

* extract helper functions

* experiment merging dims for attn w/ permute; shows perf parity with attn wo/ permute

* add tensor specialization to template args

since tensor spec packed shows perf parity when permutation isn't needed

remove redundant template args

comment on 'packed' tensor specialization

* grouped attention with input/output permute example

* format

* clean up

* refactor acc0 tile visitor
Co-authored-by: shaojiewang <wsjmessi@163.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

de37550f

26 Oct, 2022 1 commit
- testing change · 33733eee
  Astha Rai authored Oct 26, 2022
  
  33733eee
25 Oct, 2022 3 commits

Update to the Reduction API and instances (#476) · dda3a0a1

Qianfeng authored Oct 25, 2022

* Simplify the macros for declaring and defining the add_device_reduce_instance_xxxx() instances

* Change the types of lengths and strides from std::vector to std::array for the reduction device interfaces

* Remove DeviceSoftmaxImpl's depending on DeviceReduceMultiblock

* Split the cpp and hpp files for reduction instances to enable more parallel compiling

* Remove the using of macros for declaring reduction instances and instance references

* Update to add_device_reduce_instance_xxxx templated functions

* Use ReduceOperation+InElementwiseOp+AccElementwiseOp to repace the ReduceOpId in defining add_reduce_instance_xxxx() templates

* Change return format

dda3a0a1

Revert "Fused elementwise layernorm (#468)" (#491) · 6ea9257e
guangzlu authored Oct 25, 2022
```
This reverts commit efbcc6ed.
```
6ea9257e

Fused elementwise layernorm (#468) · efbcc6ed

guangzlu authored Oct 25, 2022

* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp

efbcc6ed

14 Oct, 2022 1 commit
- fixed · 7e44fd84
  Jing Zhang authored Oct 14, 2022
  
  7e44fd84
13 Oct, 2022 2 commits

Refactor device op implementations into `impl` subdirectory. (#420) · 30480288

Adam Osewski authored Oct 13, 2022



* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>

30480288

Fix bug of layernorm ckProfiler and refine code (#448) · 1b62bfaa

rocking5566 authored Oct 13, 2022

* Fix bug of profiler for layernorm

* 1. Rename layernorm into normalization
2. Decouple softmax from normalization

* clang-format

1b62bfaa

12 Oct, 2022 2 commits
- fixed isSupportedArgument · 3a9e6db3
  Astha Rai authored Oct 12, 2022
  
  3a9e6db3
- changed indexing + do/while · c2487eaa
  Astha Rai authored Oct 12, 2022
  
  c2487eaa
11 Oct, 2022 3 commits

Example contraction splitk (#430) · d8b41e1c

ltqin authored Oct 12, 2022

* start split k

* add base device class

* add example after merge develop

* add gridwise gemm

* add b matrix split k

* split=1

* change name for kb

* not bias result right

* bias only add once

* fix register spill

* regular code

* add fp32 example

* fix for 64bit index

* fix CheckValidity of gridwise

d8b41e1c

changed isSupportedArgument for 2D · e21c1785
Astha Rai authored Oct 11, 2022

e21c1785
altered indexing · 64026bc3
Astha Rai authored Oct 11, 2022

64026bc3

07 Oct, 2022 1 commit

Optimization for gridwise group norm (#453) · 40942b90

Shaojie WANG authored Oct 07, 2022



* use another instance to check the efficiency

* optimize group layer norm

* 1. coalesce load/store data for gridwise layer norm welford. 2. move a sqrt and divison into a outer static loop

* add more instances to layernorm

* add 2 more test cases

* remove ignore in generating tuple of vector
Co-authored-by: Chao Liu <chao.liu2@amd.com>

40942b90

06 Oct, 2022 1 commit
- removed extra code · 194bf17e
  Astha Rai authored Oct 06, 2022
  
  194bf17e
05 Oct, 2022 1 commit
- commented out unused code · 41bcd608
  Astha Rai authored Oct 05, 2022
  
  41bcd608
04 Oct, 2022 2 commits
- added dimensions for example file · be56fdef
  Astha Rai authored Oct 04, 2022
  
  be56fdef
- fixed 2d thread indexing · 08848bb6
  Astha Rai authored Oct 04, 2022
  
  08848bb6
28 Sep, 2022 2 commits
- updated Grid Desc · 1d97c3a4
  Astha Rai authored Sep 28, 2022
  
  1d97c3a4
- changed blockID to 2D · facdb52e
  Astha Rai authored Sep 28, 2022
  
  facdb52e
27 Sep, 2022 1 commit
- fixed NumDim dimension error · 76b44c60
  Astha Rai authored Sep 27, 2022
  
  76b44c60
26 Sep, 2022 3 commits
- fixed indexing for loop step · 4dfcf974
  Astha Rai authored Sep 26, 2022
  
  4dfcf974
- fixed compiler issues · 88d5d8d0
  Astha Rai authored Sep 26, 2022
  
  88d5d8d0
- changed NumDim into 2D · 085d9d11
  Astha Rai authored Sep 26, 2022
  
  085d9d11
25 Sep, 2022 2 commits
- added 2d version of device elementwise · 9e07a42f
  Astha Rai authored Sep 25, 2022
  
  9e07a42f
- added 2d gridwise elementwise · ad0470b5
  Astha Rai authored Sep 25, 2022
  
  ad0470b5
22 Sep, 2022 1 commit
- fix build (#434) · e9d4e893
  Chao Liu authored Sep 22, 2022
```
* fix

* fix

* add instance
```
  e9d4e893
21 Sep, 2022 1 commit
- fixed G offset calc for long_index (#428) · 01876afa
  zjing14 authored Sep 21, 2022
  
  01876afa