Commits · d91f9f119c167ac5f3974e78f09bdd007f5dfd4a · yangql / composable_kernel-1

"doc/en/git@developer.sourcefind.cn:ox696c/ktransformers.git" did not exist on "c009512a93704aaa02db2877b65cc8e661b2824c"

22 Mar, 2022 2 commits

Grouped GEMM for fp16 (#126) · 716f1c7f

zjing14 authored Mar 22, 2022

* init of grouped_gemm

* 2 gemm test

* perf test

* clean

* wrap desc into a struct

* test cast static_arr to pointer

* add ptr to GemmDesc

* add grouped gemm profiler

* fixed mem issue with unique_ptr

* clean

* clean

* finished ckprofiler

* Update README.md

* readme

* fixed readme

* add example

* improve code

* fixed comments: reserve, seperate ptr and gemm_shapes

* merge group and non-group

* fixed comments: replace push_back with emplace_back to avoid copy constructor

* fixed comments: unified blk2ctile; add test

* ci fix

* fixed ci

* fixed ci

* fixed ci

716f1c7f

Reduction for int8 and bfloat16 (#125) · 9a8ee8a3

Qianfeng authored Mar 23, 2022



* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction

* Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter

* Rename the folder name for the pool2d and reduce examples

* Update to reduction test scripts

* Add Readme for pool2d_fwd and reduce_blockwise examples

* Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX)

* Tiny fix in reduce profiler and tiny update in reduce testing scripts

* Tiny fix in testing script profile_reduce_no_index.sh

* Tiny fix in testing script profile_reduce_no_index.sh

* Add support for bfp16 reduction (using bhalf_t = ushort)

* Tiny fix in amd_buffer_addressing.hpp

* Tiny change in script/profile_reduce_with_index.sh

* Use AccDataType for Beta value and use element_wise::PassThrough

* Use type_convert for type converting in host layer reduction

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming and refining in Reduction profiler/device layer/examples

* Renaming all NumReduceDims to NumReduceDim

* Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2

* Update to testing scripts to add bf16 support

* added more static_assert

* Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp

* Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations

* minor change

* Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass

* Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp

* Tiny fix in script/profile_reduce_no_index.sh

* Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims

* Generic renaming in host reduction and DeviceReduce layer

* Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances

* Use multi-thread and simplification for host Reduction implementation

* Add ctest for reduction

* Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/

* Update to the reduce CTest executables to enable default testing behavior when no command argument

* Renaming
Co-authored-by: Jianfeng yan <jfyan008@gmail.com>

9a8ee8a3

21 Mar, 2022 3 commits

refactored deviceBatchedGemm; removed GridwiseBatchedGemm; added fp32 and int8 to profiler (#120) · cb87b049
Jianfeng Yan authored Mar 21, 2022
```
changed long_index_t to index_t when computing memory offset

uncomment other ops in profiler

added test for batched_gemm
```
cb87b049

Gemm_c_shuffle (4 layouts) X (fp32 bf16 int8) (#131) · 485ea46a

rocking5566 authored Mar 22, 2022



* [What] Separate fixpoint gemm from gemm example
[Why] let example of gemm_int8 be pure gemm.
[What]
1. Add gemm_requant_relu_requant,
2. Let CDataType be int32 in pure gemm, because no one use int8 CDataType. It is also part of gemm_requant_relu_requant

* Fix path

* Revise cmakelist due to merge develop

* Add gemm fp16 test

* Extract PrepareGemmTensor

* Extract TestGemm

* Add test for different layout

* Add 4 layouts of shuffle version of fp32

* Add 4 layouts of shuffle version of int8

* Add 4 layouts of shuffle version of bf16

* replace all DeviceGemmPtr_ with DeviceGemmNoOpPtr to fit naming convension

* Add test for non-shuffle verstion of gemm

* Fix typo

* Print kernel information

* Add rest of the fp32 kernel to the test

* 1. Add rest of the fp16 device iop.
2. Mark the invalid device operation
Co-authored-by: rocking <chunylai@amd.com>

485ea46a

Fix conv2d bwd data bug when filter is 1x1 and stride = 2 (#132) · b51808d7

ltqin authored Mar 21, 2022



* fix bwd data filter1strid2 bug

* fichangeshort to ck::bhalf_t

* reset input to zero
Co-authored-by: ltqin <letaoqin@amd.com>

b51808d7

11 Mar, 2022 1 commit

Use Space Filling Curve in Threadwise Copy (#118) · 9e33fe70

Jianfeng Yan authored Mar 11, 2022



* fixed a corner case in GetCoordinateResetStep

* clean

* rename num_accesses to num_access
Co-authored-by: Chao Liu <chao.liu2@amd.com>

9e33fe70

09 Mar, 2022 1 commit

Reorganize files, Part 1 (#119) · 5d37d7bf

Chao Liu authored Mar 08, 2022

* delete obselete files

* move files

* build

* update cmake

* update cmake

* fix build

* reorg examples

* update cmake for example and test

5d37d7bf

05 Mar, 2022 1 commit

Fix Tests build (#109) · 5b178874

Chao Liu authored Mar 05, 2022

* fix tests

* remove useless file

* fix test build

* reduce parallelism when compiling

* fix test

5b178874

04 Mar, 2022 3 commits

[Bf16 & int8] [example & ckprofiler] (#100) · 7e9a9d32

rocking5566 authored Mar 05, 2022



* Add int8 of mk_nk_mn to the ckProfiler

* Add example of int8 gemm

* Fix typo, use ushort instead of half_t for bfloat16

* replace ushortXXX_t to bhalfXXX_t

* rename ushort to bhalf_t

* Add bf16 example

* Add bf16 gemm to ckProfiler

* Fix alignment

* Fix typo

* Add unit test for gemm_xdl int8

* Add gemm_xdl fp32 unit test

* Add gemm_xdl bf16 unit test

* fix build

* fix build issue due to merge conflict

* Fix build

* Fix build error
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

7e9a9d32

Refactor threadwise copy using sfcurve (#101) · 0619ebf7

Jianfeng Yan authored Mar 04, 2022



* add space_filling_curve

* cleanup and move space_filling_curve into test

* WIP: start refactoring threadwise_transfer_v1r3

* threadwise_copy works but needs further refactoring

* add some comments

* add SpaceFillingCurve::GetIndices()

* minor changes

* removed GetIndices; refactored GetDstCoordinateResetStep

* add DynamicBuffer::Transfer, but Add is not tested

* rebased agaist develop

* threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work

* minor changes

* refactored threadcopy v3r1, v2; removed old implementations

* clang-format

* cleanup

* fix a typo in v6r3

* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>

0619ebf7

NHWC conv 2d: bwd fp32/fp16/bfp16/int8, Device level tuning and host API (#92) · c254e5ab

ltqin authored Mar 04, 2022



* start conv2d bwd api

* kernel running

* add bwd reference

* change to no shuffle

* fix bwd reference

* pass verification

* add Filter1x1Stride1Pad0 and start testing

* change some tuning parameter

* fix test error

* add fp16 tuning parameter

* add bf16 tuning parameter

* add int8 tuning parameters

* change fp32 tuning parameter

* add bwd to profiler

* fix bug for bwd profiler

* fix ckProfiler bug

* change conv2d_bwd_xdl to fp16

* fix bug in comments

* fix precompile id

* fix enum conv name

* chage _bwd_ to _bwd_data_

* change conv2d_bwd example id

* bwd to bwd data

* fix prehead

* fix MakeDefaultBlock2CTileMap ,import form merge develop

* format bwd instance

* bwd to bwd data

* change name bwd to bwd data

* change name bwd to bwd data in example

* formate code

* change conv2d bwd data id in example

* rewrite readme for example

* fix CalculateMagicNumbers about div zero

* add workaround CK_WORKAROUND_SWDEV_325164

* change test_conf2d_bwd_data show info

* format

* fix bug for workaround:CK_WORKAROUND_SWDEV_325164

* formate tuning parameters

* formate tuning parameters again

* formate tuning parameters 3

* formate tuning parameters 4

* remove add function template

* format

* update comment
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

c254e5ab

03 Mar, 2022 1 commit

Update test CMakeLists to add new tests automatically and add Jenkins stage for tests (#88) · 992f71e3

JD authored Mar 03, 2022



* add docker file and make default target buildable

* add Jenkinsfile

* remove empty env block

* fix package stage

* remove render group from docker run

* clean up Jenkins file

* add cppcheck as dev dependency

* update cmake file

* Add profiler build stage

* add hip_version config file for reduction operator

* correct jenkins var name

* Build release instead of debug

* Update test CMakeLists.txt
reorg test dir
add test stage

* reduce compile threads to prevent compiler crash

* add optional debug stage, update second test

* remove old test target

* fix tests to return proper results and self review

* Fix package name and make test run without args

* change Dockerfile to ues rocm4.3.1

* remove parallelism from build

* Lower paralellism
Co-authored-by: Chao Liu <chao.liu2@amd.com>

992f71e3

25 Feb, 2022 1 commit

Space filling curve (#96) · bdedf64b

Jianfeng Yan authored Feb 24, 2022

* add space_filling_curve

* cleanup and move space_filling_curve into test

* add functions for backward and forward step; hard coded results in unit test

* minor changes

bdedf64b

23 Feb, 2022 2 commits

Unify Convolution FWD XDL 1D/2D implementation. (#93) · 756a7617

Adam Osewski authored Feb 23, 2022



* Convolution ND

* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances

* Move convnd f32 instance file to comply with repo structure.

* Conv 1D tensor layouts.

* Formatting and use ReferenceConv

* Reference ConvFwd supporting 1D and 2D convolution.

* Debug printing TensorLayout name.

* Conv fwd 1D instance f32

* Refactor conv ND example.

Needed to support various conv dimensio.

Needed to support various conv dimensions

* Rename conv nd example director to prevent conflicts.

* Refactor some common utility to single file.

Plus some tests.

* Refactor GetHostTensorDescriptor + UT.

* Add 1D test case.

* Test reference convolution 1d/2d

* Remove some leftovers.

* Fix convolution example error for 1D

* Refactor test check errors utility function.

* Test Conv2D Fwd XDL

* More UT for 1D case.

* Parameterize input & weight initializers.

* Rename example to prevent conflicts.

* Split convnd instance into separate files for 1d/2d

* Address review comments.

* Fix data type for flops/gbytes calculations.

* Assign example number 11.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

756a7617

Conv3d new (#94) · 6dfb92bb

Jianfeng Yan authored Feb 22, 2022



* conv3d compiles but has memory error

* conv3d works

* fix performance issue by using __builtin_amdgc_readfirstlane

* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*

* clang-format

* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d

* format

* remove useless marc

* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>

6dfb92bb

12 Feb, 2022 1 commit

NHWC conv 2d: fwd bfp16/int8, Device level tuning and host API (#73) · 880fbee9

ltqin authored Feb 12, 2022



* add fwd bf16 conv

* change tunning parametor

* add int8 for conv fwd

* remove comments

* change tunning parametor for int8

* change init int8 example

* add test for conv2d fwd

* change device operation file pos because merge develop

* fwd int8 use reference

* test_conv_fwd use reference

* add braket for if statement

* rename fwd example name

* remove StaticBufferOfVectorTypeV2

* tweak example
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

880fbee9

03 Feb, 2022 1 commit

add split-k GEMM (#59) · 4be7f019

ltqin authored Feb 03, 2022



* add DeviceGemmSplitKXdl

* add file device_gemm_splitk_xdl.hpp

* set c matrix zero

* using atomic

* add all tuning parameter to f32 mkkn

* grid size change to 720

* add tunning parameter for NT

* add tunning parameter for TN

* add tunning parameter for TT

* add m=96tunning parameter

* add lost config

* add element wise operation

* fixed MPerBlock=96

* remove marco for slpitk swtich

* add test

* add new line at the end of device_gemm_xdl_instance.hpp

* remove step hack

* seperate split-k instance files

* add tunning parameters

* change disired grid size to parameters

* remove slice length

* add desiredgridsize parameter to ckProfiler

* add losting file device_gemm_xdl_splitk_instance.hpp

* change desired gride size to kbatch

* format

* format

* clean up

* add selection of device_instances

* clean code

* fix build issue
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

4be7f019

30 Nov, 2021 1 commit
- added test for magic number division (#58) · 237d4ca0
  Chao Liu authored Nov 30, 2021
  
  237d4ca0