Commits · 74495491b4efd78268583926d682c8b0e3c9cc4c · gaoqiong / composable_kernel

30 May, 2022 9 commits
- clang-format · 74495491
  rocking authored May 30, 2022
  
  74495491
- [What] Refine perf evaluation in example of gemm + reduction · 9ed2de0b
  rocking authored May 30, 2022
```
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
```
  9ed2de0b
- Fix compile error · 086625dc
  rocking authored May 30, 2022
  
  086625dc
- Merge branch 'develop' into gemm_norm · f41d5a63
  rocking authored May 30, 2022
  
  f41d5a63
- Evaluate perf of the kernel · 40bcfcde
  rocking authored May 30, 2022
  
  40bcfcde
- Refine class name · 9402ee4b
  rocking authored May 30, 2022
  
  9402ee4b
- Refine folder name · af5b9da7
  rocking authored May 30, 2022
  
  af5b9da7
- [What] Suport non pointer for invoker and argument · da8c0608
  rocking authored May 30, 2022
```
[Why] Snyc coding style with gemm
```
  da8c0608
- Refine naming · 2fc2a189
  rocking authored May 30, 2022
  
  2fc2a189
27 May, 2022 6 commits
- Fix compile error · 3bd10f4e
  rocking authored May 27, 2022
  
  3bd10f4e
- Fixing conv bug (#258) · 91d8b7d6
  Chao Liu authored May 27, 2022
```
* debugging conv

* fix oversight where ctile map is constructed before initializing c desc

* example program should returns error code

* clean up

* changed Block2CTileMap in conv2d and convnd

* clean up

* clean up

* cleanup
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
```
  91d8b7d6
- Fix typo · c06d421f
  rocking authored May 27, 2022
  
  c06d421f
- Fix compiler error due to merge from develop · e745d545
  rocking authored May 27, 2022
  
  e745d545
- Merge commit '3e6c2610 ' into gemm_norm · b2290854
  rocking authored May 27, 2022
  
  b2290854
- layerNorm verication · 253f7ef2
  rocking authored May 27, 2022
  
  253f7ef2
26 May, 2022 3 commits

Add FP64 XDL GEMM built-in function (#199) · 3e6c2610

ltqin authored May 27, 2022



* add intrin_mfma_f64_16x16x4f64

* add example

* gemm reference add double data type

* chang init data

* fix M N PerXdlops

* fix ifdef

* add comparsion config

* add conv fwd example

* format log out

* change rc matrix egister layout

* reorganize example

* reorganize example 2

* format,because merge develop

* fix call impl adding acc data type

* lost ;

* add compiler warning

* change example tunning parameters

* add test for fp64

* add instance

* add test/gemm/gemm_fp64.cpp

* fix get name issue

* remove some tunning parameter

* fix conflict

* format

* use integer value for GEMM test

* add acc data type

* remove typeid because fp16

* fix streamconfig etc bug from merging develop

* format

* remove test_gemm_xdl_fp64

* add AccDataType

* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>

3e6c2610

Add layernorm example · 667d7f0f
rocking authored May 26, 2022

667d7f0f

Add pooling example (#257) · 97c4d486

Qianfeng authored May 26, 2022

* Add example for computing LayerNorm mean and meansquare

* Refactor the pool2d_fwd example and add example for float type testing

* Revert "Add example for computing LayerNorm mean and meansquare"

This reverts commit df52e6f9d897b00c981baa48f291450bcd60925d.

* Tiny fix in pool2d_fwd_common.hpp

97c4d486

25 May, 2022 6 commits

Add 5ary elementwise for normalization · bd34d666
rocking authored May 25, 2022

bd34d666

Hotfix binary elementwise (for broadcast on fastest axis) (#254) · 82d7d993

rocking5566 authored May 26, 2022



* Support different length of ScalarPerVector

* Add example of broadcast on fastest axis

* Typo

* Refine fastest example

* Add dimension check

* Modify fastest broadcast example to 3d

* Enforce users give scalarPerVector explicitely

* 1. Add CscalarPerVedctor
2. Not only broadcast on fastest need to set scalarPerVector to 1

* Rename var

* Move IsScalarPerVectorValid() inside IsSupportedArgument()

* Separate GridDesc_M0 into A, B and C

* rename var

* Rename var of length
Co-authored-by: rocking <chunylai@amd.com>

82d7d993

Refine deviceop · 980ed33a
rocking authored May 25, 2022

980ed33a
Remove epislon · bb314592
rocking authored May 25, 2022

bb314592

Tensile-style block to C tile map (#239) · e579c9e5

Anthony Chang authored May 25, 2022

* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend

* adds tensile-stype c-tile map

* make it dynamic version

* add k-split flavor tile map

* apply tensile-style tile map to all xdl gridwise gemms

* remove dead code
Co-authored-by: Chao Liu <chao.liu2@amd.com>

e579c9e5

minor fix for recent PR (#255) · 61851ae2
Chao Liu authored May 24, 2022
```
* minor fix

* clean
```
61851ae2

24 May, 2022 6 commits

Navi21 gemm (#197) · 40b59a63

Jianfeng Yan authored May 24, 2022



* start adding navi21 GEMM

* navi_gemm_km_kn_mn_fp32 compiles and passes one test.

* rename variables and functions in gridwise_gemm_dlops_v1r3

* add other 3 layouts; format instance

* adding more tuning parameters

add tuning parameters for other 3 layouts

* add gemm_dlops_f16

* tmp

* add dependence of DeviceGemm::IsSupportedArg() on arch

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* minor changes

* push gemm_dlops into profiler

* minor changes

* if using xdl or dlops is moved into profiler_gemm_impl

* minor changes

* minor changes

* remove is_xdl from profile_gemm_impl

* make IsSupportedArg dependent on arch for other device_gemm

* minor changes

* minor changes

* fix a bug in f_generate_tensor_value

* add 64x64x64 for gemm_dlops_int8

* add 64x64x64 for gemm_dlops_int8

* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1

* fix

* start fixing tuning parameters

* monir

* minor changes

* minor changes

* minor changes

* fixing

* adding example

* adding example

* adding example

* add gemm fp32 example

* clean up

* use 128x128x16 as MNK tile in navi21 gemm example

* bug fix

* fix test

* use new block c tile

* clean

* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>

40b59a63

Overhaul to Reducton and its dependants (#237) · 63eee2d9

Qianfeng authored May 25, 2022

* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type

* Update to host layer and host reduction

* Merge and remove reduction kernels

* Merge and remove reduction device interfaces and update pooling device interface

* Merge and remove useless reduction device instances

* Update to reduction profiler and reduction ctests

* Update to reduction and pooling examples and add one reduction example

* Change to reduction examples to let them testable by ctest

* Add explicit pass checking for reduction and pooling examples

* Explicit assignment of tensor shapes in example reduce_blockwise_two_call

* Use atomic_add to repace atomicAdd and add atomic_add for double type

* Add reduce ctest support for double data type

* Replace to_int_vector() by using c++ std::vector::assign()

* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise

* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock

* Add GetAtomicOperationZeroValue() support for AtomicMax

* Tiny change to reduce example README.md

* Fix some tiny issues due to branch merging

* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t

* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64

* Renaming

* Clean the header includings in device_reduce instances header files

63eee2d9

Add performance tests as a stage of CI. (#247) · 1085794d

Illia Silin authored May 24, 2022

* modify ckProfiler_gemm output

* fix syntax

* change ckProfiler output and return 0

* fix syntax

* output datatype

* fix syntax

* output datatype in another way

* fix syntax

* fix syntax

* test return values of ckProfiler

* add layout info and tests, make sure ckprofiler returns 0

* fix syntax

* change layout output

* fix syntax

* fix syntax again

* update script to process perf results

* rearrange jenkins stages

* fix typo

* add python packages to Docker file

* adding setuptools-rust package

* modify parsing for new test parameters

* test db credentials on jenkins

* fix syntax

* update python script to handle incomplete lines

* ungrade python to 3.8 and write the gemm_params table

* add sqlalchemy package to docker

* move perf data processing to master node

* move the master node inside a steps region

* add new stage for result processing

* move results processing to separate stage

* reduce number of tests to speedup debugging

* pass config to processPerfResults stage

* run script on master in a docker container

* replace show_node_info

* try loading docker on master node again

* use ansible node instead of master

* get rid of pymysql package

* try ssh connection using paramiko

* put back pymysql

* put the perf data processing back on the gpu node

* put back artifact definition

* archive the perf_log before parsing

* clean up jenkinsfile, fix parsing

* fix typo

* enable all perf tests

* put all stages in original order, finalize script

* fix gpu_arch version

* update parsing script

* remove obsolete file causing merge conflict

1085794d

add GetWorkSpaceSize to base arg (#253) · 0d08cf18

Shaojie WANG authored May 25, 2022

* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight

* remove redundant compute

* use datatype and split k to check whether a workspace is used

* remove unused computation for work space size

0d08cf18

Merge commit 'ba58a93f ' into gemm_norm · 459f63a8
rocking authored May 24, 2022

459f63a8
Add normalize device op (not implement invoker::run()) · 577417f4
rocking authored May 24, 2022

577417f4

23 May, 2022 5 commits
- fix build (#246) · ba58a93f
  Chao Liu authored May 23, 2022
```
* fix build

* Revert "fix build"

This reverts commit d7310238

.

* post PR #235 merge fix

* amend
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
```
  ba58a93f
- Fix parameter name · aebdc4b2
  rocking authored May 23, 2022
  
  aebdc4b2
- Add reduce mean and square mean · 41f3973d
  rocking authored May 23, 2022
  
  41f3973d
- Refine file name · b42dbb47
  rocking authored May 23, 2022
  
  b42dbb47
- Implement reduction meand and reduction square mean · 44e87b4e
  rocking authored May 23, 2022
  
  44e87b4e
20 May, 2022 5 commits

example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244) · ac543313

Shaojie WANG authored May 21, 2022



* enable example of conv 1d/3d for bwd weight

* make bf16 kernel do not use atomic add

* using new gridwise gemm for bwd weight on convnd bwd weight
Co-authored-by: Chao Liu <chao.liu2@amd.com>

ac543313

remove options.hpp.in (#240) · 44943e0e
Chao Liu authored May 20, 2022

44943e0e

Refactor block to C tile map (#235) · a054f7d6

Anthony Chang authored May 21, 2022

* refactor block-to-ctile-map

* gridwise gemm block2ctile generic validity check

* format

* amend split-k gemm block2ctile map refactor

* add test

* format

* amend

* revert to calculating batch index in kernel instead of passing as block_id_z

* move file

* add valid ctile index check to gridwise v2r4

a054f7d6

[conv bwd-weight]Binding gemm k1 to conv n (#202) · 070619fb

Shaojie WANG authored May 21, 2022



* add some instance to develop

* avoid bank conflicts for wrw for all instance

* add small K1 test

* delete some unused instance

* binding gemm k1 to conv n

* try using half_4 to do ds_read

* reset buffer load oob and ds memcpy to default option

* remove useless instances

* remove redandunt space

* remove printf code

* clang-format-10 change

* use fastest config

* fix clang format for the other files

* remove gemmk0 pad for output

* add gemmk padding macro

* add bank length computation

* add template to distinguish the instance that need lds padding for wrw

* use rocm5.1 as docker

* use integer value for GEMM test

* add Right padding macro

* add 2 test asm code

* using 256x256x32 tile size

* 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code

* using small vec

* 256*128 kernel size for example

* remove asm files

* use a new gridwise gemm header for bwd-weight

* revert gridwise gemm v2r4r2

* change foramt

* reset gridwise gemm v2r4r2

* remove unused code

* revert instance file

* revert example instance

* format file

* remove macros

* resolve compile error

* rename wrw kernel invoker

* use gridwisegemm pipeline struct instead of implement run fucntion in the same header
Co-authored-by: Chao Liu <chao.liu2@amd.com>

070619fb

remove unused conv bwd data profiler header and cpp (#245) · b31b588d
Shaojie WANG authored May 21, 2022

b31b588d