Commits · d66421fe34f2b69de7fe53876a7eb5dea4f3fd9f · gaoqiong / composable_kernel

"src/include/threadwise_2d_tensor_op.hip.hpp" did not exist on "c8d0356a34c835776f1e9e08c06b99b8c58d1e0a"

18 Jan, 2023 2 commits

Add multiD Gemm client APIs (#534) · d66421fe

ltqin authored Jan 19, 2023



* start add example

* fix config

* fix showinfo bug

* add an elementop

* change to padding

* add xdl example

* change elementwiseop

* add instance

* add instance to profiler

* change file name

* fix deive not support issue

* add client example

* fix client gemm_add_multiply name

* change AddMultiply elementwiseop

* fix elementwiseop

* fix client example

* fix addmultiply op

* fix comments and fun name
Co-authored-by: letaoqin <letaoqin@amd.com>

d66421fe

add multi embeddings support (#542) · 147b7db5

who who who authored Jan 19, 2023

* add multi embeddings support

* fix format

* optimize sqrt

* add reduce operation

* change to elementwise op

* fix name

* rename

* run ci cd

* format example

* format code

* format code

147b7db5

17 Jan, 2023 3 commits

Reduction external API and client examples (#493) · 80e05267

Qianfeng authored Jan 17, 2023



* Change to the DeviceReduce base class template to include all problem description information

* Add external api for reduction

* Add client example to test the reduction external api

* Spelling correction

* Re-implement the host_reduction to follow the DeviceReduce base API format

* Change the reduce profiler to call the external API for collecting device instances

* Rename reduce client example directory from 08_reduce to 12_reduce

* Remove (void) before the functional call

* Tiny update in reduce client example

* Tiny update in profile_reduce_impl.hpp

* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

80e05267

Gemm layernorm welford (#413) · 7829d729

rocking5566 authored Jan 17, 2023



* Add device op of gemm layernorm

* [What] Rename F to H
[Why] F and G prepare for welford tensor

* Add gridwise gemm + welford

* Extract template parameter

* Rename kernel. Prepare to add second half kernel

* Extract var

* Add second kernel for gemm+layernorm

* Move to the gemm_layernorm folder

* Rename F and G to mean and var

* Do not use snakeCurved, it makes determination of padding  for welford difficult

* Rewrite the device interface and rename some var

* Add welford count

* Update interface

* Sync code, prepare to test on MI200

* Clean the code

* Implement layernorm

* Add comment to mension hipFree

* Wrtie out the e for debug.
This could be remove and use h for instead

* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size

* Add gemm layernorm host code

* use reference layernorm

* Fix bug of blockwise welford for first kernel

* Fix bug of mean var padding for layernorm

* Use sgpr for shuffleM_index

* padding for GemmMeanVarCountGridDescriptor_M_NBlock

* Add layout parameter

* Check argument for gemm

* calculate max count for tail block

* Share E and H memory in device op

* Hard code the vector dim

* Refine the MakeDescriptor

* 1. Remove E parameter, because E is inside of device op
2. Check vector size

* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor

* Use 1D global memory for count

* Prevent redundant IO

* Update parameter

* Add pipeline v1/v2 selector

* Rename the example name

* Add base class for gemm layernorm

* Refine naming to distinguish naive and welford

* Add comment to explan in detail

* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1

* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel

* Share the vector size

* Refine var name

* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce

* Add comment

* Extract divisor out of the loop in reference layernorm

* Pad different size for E and H in layernorm kernel according to different block tile

* Refine naming

* Refine naming

* Prevent implicit cast

* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting

* Cast only constant

* Change of post shuffle thread descriptor

* Add EMeanVarDataType parameter.

* Merge the mean and var threadwise copy

* Add missing index

* Fix Typo

* Sync the variable with previous if

* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

7829d729

[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 (#541) · 919aeb1f

Haocong WANG authored Jan 17, 2023

* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* tempsave

* tempsave

* tempsave

* runtime bug, cannot find symbol

* workaround for incorrect HIP warpSize return value

* debugging

* tempsave

* Correctness OK, waiting for optimization

* Tidy up + format

* temp save

* temp save, reproduce the v_bfi_b32 issue

* add inline asm for wmmaop test

* tidy up

* clean some debug purpose code

* discard some codes

* clang format

* clang format

* compiler issue fixed + increase tile size

919aeb1f

12 Jan, 2023 2 commits

Add a flag to enable/disable debug output in many kernels. (#549) · 715e8dd2

Illia Silin authored Jan 11, 2023

* add DEBUG_LOG macro to enable/disable debug output

* fix syntax

* fix syntax again

* fix syntax one more time

* remove balnk spaces

* use ifdefs

* add the Print argument

* move the definition of DEBUG_LOG to ck.hpp

* add the missign argument to Print()

715e8dd2

Remove including of cmath (#551) · a17b0414

Qianfeng authored Jan 12, 2023

* Let cmath included when compiling host codes in math_v2.hpp

* Remove including of cmath in device_base.hpp and device_permute.hpp

a17b0414

15 Dec, 2022 2 commits
- Add MNK padding, M = 0 support into grouped_gemm (#539) · 0345963e
  zjing14 authored Dec 15, 2022
```
* add mnk padding, support m=0

* clean code

* clean code
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
```
  0345963e
- Add interface GetTypeIdName() and GetTypeIdHashCode() for Device Op (#533) · 10c72ace
  Qianfeng authored Dec 15, 2022
  
  10c72ace
12 Dec, 2022 1 commit

Gridwise elementwise 2d (#466) · 0e5c264c

arai713 authored Dec 12, 2022



* added 2d gridwise elementwise

* added 2d version of device elementwise

* added example file with updated device elementwise call

* added Cmake file

* changed NumDim into 2D

* fixed compiler issues

* fixed indexing for loop step

* fixed NumDim dimension error

* changed blockID to 2D

* updated Grid Desc

* updated kernel call

* fixed 2d thread indexing

* added dimensions for example file

* commented out unused code

* changed vector load

* removed extra code

* temporarily removing vector load on 2nd dim

* changed vector load back, still causing errors

* altered indexing

* changed isSupportedArgument for 2D

* changed indexing + do/while

* fixed isSupportedArgument

* changed dimension for debugging

* fixed

* added testing printouts

* testing change

* added variables to distribute threads through both dimensions

* testing changes

* integrated variable for thread distribution into device elementwise and added as parameter for gridwise elementwise

* removed most of the extraneous code, testing with different dimensions

* testing

* removed debugging print statements

* moved 2d elementwise permute into elementwise permute directory

* fixed formatting

* removed debugging comments from threadwise transfer
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

0e5c264c

08 Dec, 2022 1 commit

Make sure that GEMM sizes in K dimension are supported. (#527) · d58b7f51

Illia Silin authored Dec 08, 2022

* apply new K-dimension check in gemm_xdl_cshuffle

* add K-dim check to gemm_xdl and batched_gemm_xdl

* fix syntax

* fix syntax

* clean-up the debug output

d58b7f51

07 Dec, 2022 1 commit
- modified half function in math_v2.hpp (#528) · ce87b4f7
  guangzlu authored Dec 08, 2022
```
Co-authored-by: Chao Liu <chao.liu2@amd.com>
```
  ce87b4f7
02 Dec, 2022 3 commits

Fix bug where scaling may not be applied in some code path (#526) · d1567094
Anthony Chang authored Dec 03, 2022
```
* fix bug where scaling may not be applied in some code path

* more test

* revert accidental example code changes
```
d1567094

Add multiple d gridwise gemm on Navi21 for ResNet50 (#517) · 23ecf0fa

ltqin authored Dec 03, 2022



* start add example

* add multiple d fp16 example

* device transfer elementwiseop to gridwise

* gridwise add multiple d

* change example for multiple d

* fix spill registers

* fix for passthrough element op

* fix int8 overflow

* change example file name

* add instance for dl multiple d

* example add DsDataType

* remove grouped_convolution_forward_dl.hpp

* add head file(was deleted before)

* fix not support device issue

* format

* remove passthrough check
Co-authored-by: letaoqin <letaoqin@amd.com>

23ecf0fa

[Navi3x-LWPCK-449] wmma_op + unit test (#484) · abf9cc6c

Haocong WANG authored Dec 03, 2022



* wmma_op + unit test

* add arch limitation to wmma test

* change arch limitation

* Refactor + Add all type unit test(int4 compile failed)

* Add f32_16x16x16_bf16 unit test

* Remote int4 related

* delete deprecated test
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

abf9cc6c

30 Nov, 2022 2 commits

gemm, conv perchannel quantization (#503) · ad541ad6

rocking5566 authored Dec 01, 2022

* Use gemm_multiple_D instead

* Add gemm bias relu quantization example

* Add pure gemm quantization example

* Add quantization of perchannel conv + bias + relu example

* Refine the code

* Rename multiplier to requant_scale

* Rename the folder

* Remove redundant comment

* Rename the file. Prepare to add perchannel

* Add conv perchannel instance

* Move to quantization folder

* Add conv perchannel client example

* Apply Rangify constructor of HostTensorDescriptor & Tensor<>

* Fix merge error

ad541ad6

BatchNorm backward instance/external API/profiler/tests (#519) · 63af525c

Qianfeng authored Dec 01, 2022

* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()

63af525c

29 Nov, 2022 2 commits

fix GetTypeString · 0e9c88ce
fsx950223 authored Nov 16, 2022

0e9c88ce

BatchNorm backward implementation (#461) · 44789d99

Qianfeng authored Nov 29, 2022

* Implemented batchnorm-backward Blockwise and Multiblock kernels

* Add batchnorm-backward device op

* Add batchnorm-backward host-reference op

* Add batchnorm-backward example

* Parameters renaming in batchnorm backward kernels and device op

* Change in the example to loose the threshold for ScaleDiff checking

* Add comments to explain the implementation of batchnorm-backward

* Parameters renaming again in batchnorm backward kernels

* Improve the expression calculation for performance

* Add batchnorm backward to README

* Add comments to explain inv-variance in batchnorm forward and backward

* Renaming the batchnorm forward training and inferring examples

* Add/update the comments for batchnorm-backward kernels

* Renaming again

* Add block_sync_lds between two consecutive blockwise reductions

* Move common expression 1/N out of the static_for loops

* Add dy_elementwise_op

* Renaming in backward example again

* Add checking for reduceDims in reference_batchnorm_backward

* Update to comments and codes format

* Rename in the comments

* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c

* Add block_sync_lds() between blockwise reduction again

* Fix comments again

* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test

44789d99

25 Nov, 2022 1 commit

BatchNorm forward instance/external api/profiler/tests/client example (#511) · 4e6a5575

Qianfeng authored Nov 25, 2022



* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

4e6a5575

20 Nov, 2022 1 commit

Client examples AddFastGelu and FastGelu + instances. (#509) · 43a889b7

Adam Osewski authored Nov 20, 2022



* FastGelu support for more data types.

* AddFastGelu & FastGelu instances.

* Client example.

* clang-format

* Remove unused stride variable.

* Add new line at EOF.
Co-authored-by: Adam Osewski <aosewski@amd.com>

43a889b7

17 Nov, 2022 1 commit
- Work around develop validation failure (#513) · 892a8d76
  Anthony Chang authored Nov 18, 2022
```
* workaround bf16 atten fwd issue on gfx908

* typo
```
  892a8d76
15 Nov, 2022 4 commits

Add BF16 tests for batched_gemm_softmax_gemm_permute (#504) · 4c4c7328

guangzlu authored Nov 16, 2022



* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm

* added bf16 tests for batched_gemm_softmax_gemm_permute

* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp

* aligned annotations

* modified CMakeLists for examples

* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl

* use macro to control the instances

* added macro control into instances

* clang-format some files

* changed error tolerance for bf16

* changed index for 10_elementwise_normalization

* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

4c4c7328

Add Conv Backward Data on Navi21 for ResNet50 (#499) · db0eb1ea

ltqin authored Nov 16, 2022



* start add example

* add device dl

* change launch kernel

* change init data method

* change example config

* add config valid check

* add instance for dl bwd

* add instance to ckProfiler

* reserver to profiler and cmakelist

* add instance to ckProfiler2

* change instance f32 config

* fix example return value
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

db0eb1ea

Avoid reporting unused member function error (#507) · 7038723a
Po Yen Chen authored Nov 15, 2022

7038723a

Introduce ck::accumulate_n() (#439) · 730204ee

Po Yen Chen authored Nov 15, 2022

We can use this template to eliminate duplicated iterator computing
logics. By providing return type to ck::accumulate_n(), we can avoid
type conversion operations.

730204ee

11 Nov, 2022 1 commit
- Fix build errors on CI server (#506) · 4382b414
  Po Yen Chen authored Nov 12, 2022
```
* Add missing ignore expression

* Add missing include directive
```
  4382b414
10 Nov, 2022 4 commits

add client example for elementwise_normalization (#501) · 70456328

guangzlu authored Nov 11, 2022

* add client example for elementwise_normalization

* clang format elementwise_layernorm2d.cpp

* changed some naming to make it more understandable

* changed naming of input into ab_input

* fixed bug for threadwise_x_store

* add elementwise operation to reference

70456328

Add client example of grouped conv2d forward (data type: fp16) (#488) · f4980310

Po Yen Chen authored Nov 10, 2022

* Rename example folder for GroupedConvFwdMultipleD

* Unify example codes

* Change target names

* Add fp16 example for multiple d instance

* Re-format common.hpp

* Add interface 'DeviceGroupedConvFwd'

* Use simpler interface

* Move common conv params out

* Rename conv fwd client example folder

* Add missing include directive

* Update grouped conv instance implementations

* Simplify ckProfiler (grouped conv forward)

* Use GroupedConvFwd to implement client example

* Use greater groupe count in example

* Add custom target to group examples

* Add extra tag param to instance factory function

* Use tag to differentiate factory functions

* Add missing tag argument for factory function

* Remove inheritance relationship

* Remove no-longer used include directive

* Add license in front of file

f4980310

Add client example of grouped conv2d backward weight (data type: fp16) (#498) · 38470e04

Po Yen Chen authored Nov 10, 2022

* Remove redundant CMake setting

* Extract common code from files

* Rename folder 'convnd' to 'conv'

* Use std::array<> to accept compile-time kwnown # of arguments

* Fix compilation error of tuning parameter

* In example, use same setting as unit-test

* Remove no-longer used include directive

* Add interface for grouped conv bwd weight

* Add group support for conv bwd weight

* Add grouped conv bwd weight example

* Use group parameter in example

* Rename example folder

* Remove non-grouped version example source files

* Rename device op template

* Add group support to convolution backward weight

* Remove debug messages

* Use smaller group size in example

* Use named variable as loop terminate condition

* Prettify example output message

* Enlarge used grid size

* Allow real grid size exceeds expected grid size

* Rename interface file

* Add client example for grouped conv2d bwd weight

* Fix wrong include directive

* Rename client example folder

38470e04

Remove interface 'DeviceGroupedConvBwdData' (#500) · 67423a22

Po Yen Chen authored Nov 10, 2022

* Remove interface 'DeviceGroupedConvBwdData'

* Remove no-longer needed include directive

* Rename client example folder

67423a22

03 Nov, 2022 1 commit

Fused elementwise normalization (#492) · 8a4253ba

guangzlu authored Nov 04, 2022

* add fused addition lyernorm

* add fused addition lyernorm

* changed CMakelist

* removed annotates

* modified descriptor of C

* fixed bug in gridwise add layernorm

* format the files

* modified name from add&layernorm into elementwise&layernorm

* created fused elementwise layernorm branch

* change input into tuple type

* add sweep once to reduce load & read of C from global memory

* modified Argument api

* modified way to malloc c in global memory

* changed gamma and beta to m_k_desc

* fixed bug when sweep once and move CDataType when define device level struct

* add src dim for gamma and beta

* implement optimization for coalesced

* delete a annotation line

* fixed some bug to meet the requirements of ck

* add bandwidth computing in example, and fixed the time unit

* move device_elementwise_layernorm_impl.hpp into device/impl

* fixed bug in device_elementwise_layernorm_impl.hpp

* changed name from layernorm into normalization

* clang-format the changed files

* changed the names

* moved immidiate results into lds, it become faster in non-sweeponce cases

* changed naming of C into X to make the defination more clear

* changed naming in example

* add tests for elementwise normalization

* move example_elementwise_layernorm_blockwise into folder 44_elementwise_normalization

* move test_elementwise_layernorm_fp16 into new folder

* move elementwise_normalization_instances into a new folder

* add more tests in test_elementwise_layernorm_fp16.cpp

* added some corner cases in test

* fixed method to compute lds size for matrix X

* changed name of 44_elementwise_normalization into 45_elementwise_normalization

* modified some comments

* modified some other confused comments

* reduce redundant tests in test_elementwise_layernorm_fp16.cpp

8a4253ba

02 Nov, 2022 6 commits

Refine layernorm naming and test code (#497) · d4d1147f

rocking5566 authored Nov 03, 2022

* Sync the naming

* Sync the test of layernorm with groupnorm

* Sync the naming

* Minor change for comment and log

* [What] Add saveMean and SaveInvVariance in the interface.
[Why] These can optimize the backward

d4d1147f

remove atten kernel workarounds as we move over to rocm 5.3 (#496) · 451f1e3d
Anthony Chang authored Nov 03, 2022

451f1e3d

Add client example of grouped conv2d backward data (data type: fp16) (#481) · 9e57a290

Po Yen Chen authored Nov 03, 2022

* Improve example reusability

* Remove no-longer used file

* Rename folder of grouped_conv_bwd_data example

* Add normal grouped conv bwd example

* Add interface 'DeviceGroupedConvBwdData'

* Prettify comment of device op type arguments

* Add grouped conv2d/conv3d backward data fp16 instances

* Fix wrong template argument

* Add grouped_conv2d_bwd_data client example

* Use simpler expression to calculate memory size

* Fix formating

* Remove grouped_conv3d_bw_data instances

Underlying device operator is not ready to handle 3D input

* Remove no-longer necessary include directive

* Add missing include directive

* Use more realistic conv param in example

9e57a290

Add pipeline v1/v2 selector, add more instances (#381) · 1a0b0e7b

Rostyslav Geyyer authored Nov 02, 2022



* Add gridwise gemm pipeline v1/v2 selector

* Pipeline selector working, test-wise add pipeline options to one instance

* Add gemm instances

* Add debug info to DeviceGemmXdl

* Add debug info to DeviceGemmXdl_CShuffle

* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu

* Minor fix

* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm

* set up inter-wave configuration

* use defualt loop scheduling for supported gemm ops

for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI

* Add enum PipelineVersion

* Update instances

* Format

* Fix the merge conflict

* Add flags to disable added instances

* Test disable flag check

* Disable flag check

* Enable the instances
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

1a0b0e7b

Softmax unit-test reduction across all and non innermost dims cases. (#406) · 6d8614ee

Adam Osewski authored Nov 02, 2022



* Add reduction across all dims cases.

* host softmax: handle all reduce

* Test cases when reduced dim is not innermost axis.

* Fix syntax.

* Test non innermost dim for fp32 and int8

* Group test suites wrt NumReduceDim.

* Additionally test failing cases.

* Throw error when Rank or NumReduceDims doesn't match arguments.

* Check reducedDims has correct values

* Move don't reuse DeviceReduceMultiblock IsSupportedArgument method.
Instead implement own. (in fact just get rid of one check to enable
reduction across inner dimensions).

* Reorganize unit tests to better cover use scenarios.

* Test input validation
* Test reduction of inner dimensions with custom op instances.

* Refactor fp32 and int8 unit tests.

* Fix FP32 instance template parameters.

* Add more instances.

* Instances with InSrcVectorDim=0.

* Do not initialize and copy data when arg not supported.

* ckProfiler Softmax use instance factory.

* Refactor device softmax IsSupported.

* Additionally add non-polymorphic api functions

* Split softmax instances into multiple files.

* Fix profiler.

* Reorganize tests to reuse profiler and cover edge cases.

* Clang-format

* I8 Softmax instances along with UT.

* Reuse type alias definitions from instance factory header.

* Clean included headers

* Fix variable names.

* Add missing checks in Argument constructor.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>

6d8614ee

Conv perlayer int8 quantization (#471) · 226bc02b

rocking5566 authored Nov 03, 2022

* Add conv2d requant example

* Fix bash error

* Rename example

* 1. Rename gemm quantization
2. shares the requantization lambda function with conv

* Refine declare type

* Add conv bias relu quantization exmaple

* clang format

* Fix compile error due to merge develop

* Fix CI error

* Extract quantization post operation into another file

* Support quantization for non piecewise linear function

* Add instance for conv quantization

* Add convolution quantization factory

* Add convolution quantization client example

* Add more instances with different template parameters

* clang format

* Sync the naming with the develop

226bc02b

31 Oct, 2022 1 commit

Add Conv Forward on Navi21 for ResNet50 (#490) · 8ee36118

ltqin authored Nov 01, 2022



* add device of dl

* fix k1 of GridwiseGemmDl_km_kn_mn_v1r3

* init version for dl conv

* add example(init)

* result right

* disable elementwise operation

* check parameters

* add fp32,int8 example and change check code

* change deive file and class name

* add check vector access of C

* add instance

* add to ckProfiler

* add Filter1x1Pad0 instances

* fix ignore error

* fix for CI
Co-authored-by: letaoqin <letaoqin@amd.com>

8ee36118

28 Oct, 2022 1 commit

Batchnorm-forward implemented using welford method to calculate variance (#403) · 7fa892e6

Qianfeng authored Oct 28, 2022



* Update to the batchnorm-forward API and base class

* Fix leeked header including in gridwise_set_buffer_value.hpp

* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction

* Update to the batchnorm-forward example to use the new batchnorm-forward device interface

* Change the batchnorm-forward reference to use sequential welford method

* Change to assign the workspace into four buffers in the host layer

* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford

* Tiny correction and remove un-used file under example/34_batchnorm

* Renaming in the kernel arguments

* Explicitly use ck::math::sqrt in batchnorm-forward kernels

* Add some comments to some kernels

* Tiny fix

* Generalize the data types in reference_batchnorm_forward_nhwc_c

* Use ck::ignore to mark un-used parameters

* Move GetReduceCountPerThread functor codes from kernel to device

* Remove some un-used codes in device_batchnorm_forward_impl.hpp

* Tiny fix in batchnorm_forward example

* Move GetReduceCountPerThread() to welford_helper.hpp

* Use seperate data type for Scale and Bias

* Renaming in device Op

* Tiny fix in forward example

* Updata to batchnorm-infer (type spliting, renaming)

* Add time and bandwidth measurement to the batchnorm-forward example

* Add support of elementwise operation for batchnorm forward output

* Reduce object copying by passing object as reference type

* Tiny change for performance

* Updates for performance again

* Some Renamings

* Add GetActualVariance template parameter for ThreadwiseWelfordMerge

* Tiny update in reference batchnorm forward nhwc/c

* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory

* Fuse mean and bias in the normalization calculation
Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>

7fa892e6