Commits · 10732847e73496e59f398d894c50dd9a920f1bd4 · gaoqiong / composable_kernel

21 Jul, 2023 1 commit
- Grouped conv bwd wei NDHWGC/NDHWGK (#804) · 10732847
  Bartłomiej Kocot authored Jul 21, 2023
  
  10732847
18 Jul, 2023 3 commits

Grouped 3d conv backward data support (#799) · 49180fd6
Bartłomiej Kocot authored Jul 18, 2023
```
* Grouped 3d conv backward data support

* Fix comments
```
49180fd6
Remove type_convert bf16 to int32 and back (#802) · f82bd593
Rostyslav Geyyer authored Jul 18, 2023

f82bd593

Add mechanism to build CK for select data types, add Navi3x CI. (#790) · 189ea3b9

Illia Silin authored Jul 17, 2023

* allow building CK for specific data types

* add CI build and test stage on Naiv3x without some int8 instances

* add missing gemm fp16 instances

* add the changes to the missed cmake file

* add empty lines at end of source files

* Do not build quantization client example on navi3 in CI

* disable batched_gemm_multi_d_int8 instances with DTYPES

* disable device_conv2d_bwd_data_instance with DTYPES

* fix ckprofiler for conv_bwd_data for int8

* properly isolate the conv_bwd_data int8 instances

* remove empty line

189ea3b9

17 Jul, 2023 1 commit

Add check for compiler GPU target support. (#800) · 4867db42

Illia Silin authored Jul 17, 2023

* check if gpu_targets are supported by compiler

* set default list of targets and filter for them

4867db42

15 Jul, 2023 1 commit
- Disable Werror to ignore xnack+ warnings (#794) · 03d3395b
  arvindcheru authored Jul 14, 2023
```
* Disable Werror to ignore xnack+ warnings
```
  03d3395b
12 Jul, 2023 1 commit

Support NHWGC conv2d_bwd_weight (#769) · 1ee99dca

Bartłomiej Kocot authored Jul 12, 2023



* Support NHWGC conv2d_bwd_weight

* Fix client example

* Fix client example

* Fix comments

* Redesign grouped_conv_bwd_weight instances

* Clang format fix

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1ee99dca

07 Jul, 2023 1 commit
- change the build thread usage in CI (#787) · 87f2bbcf
  Illia Silin authored Jul 06, 2023
  
  87f2bbcf
06 Jul, 2023 5 commits

Add basic setup for precommit (#749) (#764) · 237f9cd3

Adam Osewski authored Jul 06, 2023



* Add basic setup for precommit

* Update README.md with instructions on installing precommit hooks

---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>

237f9cd3

Split GEMM instance library & enable pipeline v2 optimization (#783) · 850144a0

Po Yen Chen authored Jul 06, 2023

* Move source file into sub-directories

* Add missing include directive

* Split DeviceGemmXdl<> fp16 instances

* Fix format

* Remove unnecessary CMakeLists.txt

* Add macros to toggle new features

* Remove debug message

* Turn off GEMM v2 pipeline optimization by default

* Fix format

* Extract duplicated string as list

* Enlarge indent in CMakeLists.txt

850144a0

Batchnorm splitk single kernel (#771) · 8f5cafaf

Qianfeng authored Jul 06, 2023

* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c43f10759e7b4e4d112125221ed9064.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file

8f5cafaf

Move Device Ops implementations into impl directory. (#777) · f4dfc060
Adam Osewski authored Jul 06, 2023
```
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
```
f4dfc060
Fix copyrights for DeviceBatchedGemmMultipleD_Dl · 2b0b6d9f
Bartlomiej Kocot authored Jul 05, 2023

2b0b6d9f

05 Jul, 2023 2 commits
- Add the missing archs (#785) · 61dc9aa9
  Rostyslav Geyyer authored Jul 05, 2023
  
  61dc9aa9
- Add fp8 GEMM and an example for it (#767) · 1cf50031
  Rostyslav Geyyer authored Jul 04, 2023
```
* Add fp8 xdl gemm

* Add example

* Use int8 intrinsics for buffer load/store

* Format

* Update cmakelists
```
  1cf50031
30 Jun, 2023 1 commit
- Upgrade default docker to ROCM5.6 release. (#778) · 7797bd3d
  Illia Silin authored Jun 30, 2023
```
* upgrade default compiler to rocm5.6 release

* do daily runs with rocm5.6 instead of 5.5
```
  7797bd3d
28 Jun, 2023 1 commit
- Add rocm5.6 RC4 and rocm5.7 to docker build options. (#770) · d3adc665
  Illia Silin authored Jun 28, 2023
```
* upgrade to rocm5.6 rc4

* add rocm5.7 docker
```
  d3adc665
21 Jun, 2023 2 commits
- do not build gfx941/942 targets during CI (#766) · 3b18f1e3
  Illia Silin authored Jun 21, 2023
  
  3b18f1e3
- Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757) · 63388e84
  Bartłomiej Kocot authored Jun 21, 2023
```
* Support bf16/f32/f16 and NHWGC conv2d_bwd_data

* Add interface test

* clang format

* Comment fixes

* Add more friendly error message
```
  63388e84
20 Jun, 2023 2 commits
- remove useless comments (#760) · 32d2f52b
  ltqin authored Jun 20, 2023
  
  32d2f52b
- changed pipeline v1 (#763) · 05ea6452
  zjing14 authored Jun 19, 2023
  
  05ea6452
19 Jun, 2023 3 commits

do not build gemm-gemm and conv-conv examples for gfx94* (#761) · 645eb2f2

Illia Silin authored Jun 19, 2023

* do not build gemm-gemm and conv-conv examples for gfx94*

* do not build gemm-gemm and conv-conv examples on navi

645eb2f2

FP8 enablement - add a pseudorandom number generator, add conversion methods (#708) · f0c620c4

Rostyslav Geyyer authored Jun 19, 2023

* Add basic fp8 definitions and prn-generator

* Format

* Add fp8<->fp32 type_convert

* Format

* Split type_convert and cast_to/from_f8

* Format

* Minor fix

* Minor fix

* Move fp8 utils to a separate header

* Add elementwise ops

* Add fp8_convert_sr

* Format

* Add element op

* Eliminate magic numbers

* Split f8_convert_sr in host and device

* Format

* Add some constexpr

* Add a datatype test

* Format

* Another format

* Add fp8<->fp16 tests

* Update type_converts

* Format

* Add fp16 casting functions

* Format

* Use seed as a runtime arg

* Use element location for PRNG

* Format

* Add fp8<->fp16 to PassThrough element op

* Clean up

* Merge host and device implementations

* Add comments on rounding modes

* Remove leftover code

* Put type_converts into a separate header

* Put random number gen to a separate header

* Rearrange f8_utils' namespaces

* Refactor type_convert.hpp

* Move f8_t definition

f0c620c4

Maxpool bwd (#750) · 341ad956

rocking authored Jun 19, 2023

* Add maxpool f32 kernel and example

* Revise copyright

* Add device pool bwd device op

* Support f16 and bf16

* Add compute datatype for reference code.
Prevent error in bf16

* Fix type error

* Remove layout

* Fix bf16 error

* Add f16 and bf16 example

* Add more operations

* Implement IsSupportedArgument

* Add changelog

* Add comment

* Add comment

* Remove useless header

* Move initialize of workspace to the run

* Move set din zero to the device operator

* Save din_length_raw

* Remove useless header

* Calculate gridsize according to the number of CU

* Calculate gridSize according to the number of CU.
Remove useless header

* Add put example

* Remove useless header

* Fix CI fail

341ad956

17 Jun, 2023 1 commit

Padded Generic Kernel Instance (#730) · 0d911822

Qianfeng authored Jun 17, 2023



* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting

* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization

* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax

* Add testing of GetGenericInstance() in client_example of Softmax

* Revert "Add testing of GetGenericInstance() in client_example of Softmax"

This reverts commit f629cd9a93ce38dfed4886d849f3c38d2e5379c8.

* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"

This reverts commit a9f0d000eb9fd240404112a526ef125429a351df.

* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm

* Move generic kernel instance to separate tuple for elementwise op of normalization

* Remove un-used files for softmax instance

* Store generic kernel instance to separate tuple for softmax

* Add IsSupported checking for generic instance to client example of softmax

* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization

* clang-format fix

* Remove int8 from softmax instances

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

0d911822

16 Jun, 2023 1 commit
- do not build gfx941/942 targets during daily QA runs (#758) · d140bdc9
  Illia Silin authored Jun 16, 2023
  
  d140bdc9
15 Jun, 2023 3 commits

Enable gfx941 and gfx942 architectures. (#752) · 027e46ee

Illia Silin authored Jun 15, 2023

* enable gfx941/942 targets

* fix clang format

* fix the cmake logic for multiple targets

* fix cmake syntax for looping over targets

* add gfx941/942 support for gemm_xdl instances

027e46ee

Fixed Weight layout of grouped_conv 3d fwd (#743) · 309b1c64

zjing14 authored Jun 15, 2023



* Changed wei layout

* changed layout for examples

* fixed client example

---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

309b1c64

Using number of compute units to set gridSize (#754) · c5f6ec84

Qianfeng authored Jun 15, 2023

* Add getAvailableComputeUnitCount() interface

* Use available number of compute units to set kernel grid size

c5f6ec84

14 Jun, 2023 2 commits

Fix the daily CI job with latest staging compiler. (#753) · d1838d32
Illia Silin authored Jun 14, 2023
```
* fix CI builds with latest staging compiler

* remove mount flags from dockerfile
```
d1838d32

Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741) · 54b68eb3

Rostyslav Geyyer authored Jun 14, 2023

* Add generic instance gemm_add_add_fastgelu

* Add a client example for generic gemm_add_add_fastgelu

* Update CMakeLists

* Format

* Format

* Add generic instance gemm_add_fastgelu

* Format

* Add a gemm_add_fastgelu client example

* Format

* Add generic instance gemm_fastgelu

* Format

* Fix argument order

* Add gemm_fastgelu client example

* Add exceptions if argument is not supported

54b68eb3

12 Jun, 2023 4 commits

Fix arg order (#751) · a35456a3
Rostyslav Geyyer authored Jun 12, 2023

a35456a3

Add DeviceBatchedGemmMultipleD_Dl (#732) · fc9f9756

Bartłomiej Kocot authored Jun 12, 2023

* Add DeviceBatchedGemmMultipleD_Dl

* Fix batched_gemm tests

* Fix comments

* test_batched_gemm_multi_d fixes

* Fix args for isSupported batchedGemmMultipleDDl

* Disable tests for gfx90a

fc9f9756

Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738) · 7c24654c

Po Yen Chen authored Jun 12, 2023

* Fix wrong pointer type

* Rename type trait get_unsigned_int<> to get_carrier<>

* Add 3-bytes carrier type

* Add missing __device__ specifier

* Rename template non-type parameter

* Leave the rest byte uninitialized

* Avoid invoking (host) STL algorithms

* Remove unnecessary 'inline' specifier

* Extract common logic out as helper method

* Hide dummy member function

* Add missing __device__ specifier

7c24654c

Fix flash attn mask bug (#733) · 0ede66de

ltqin authored Jun 12, 2023



* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

0ede66de

08 Jun, 2023 1 commit
- support dynamic buffer using memory coherence glc_slc bit from template (#725) · 016ebaa7
  carlushuang authored Jun 08, 2023
  
  016ebaa7
07 Jun, 2023 1 commit

Update docker (#744) · 1dd455d6

Illia Silin authored Jun 07, 2023

* update dockerfile to build rocm5.6 rc3

* fix couple of docker issues

1dd455d6

02 Jun, 2023 1 commit
- fix clang format (#740) · 40365904
  Illia Silin authored Jun 02, 2023
  
  40365904
01 Jun, 2023 2 commits

replace hipMemcpy with hipMemcpyWithStream (#734) · e2ebc8e7
who who who authored Jun 02, 2023

e2ebc8e7

Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723) · 9eae73df

Po Yen Chen authored Jun 02, 2023



* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Share same name for kernel interfaces

* Fix wrong boundar calculation logic

* Leave the third template arg for compatibility

* Remove unnecessary parameters

* Fix wrong error message (for type name)

* Create descriptor on device side

* Fix wrong debug message

* Remove no-longer used data members

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Reject unsupported argument

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Update function calls

* Reorder statements

* Re-format files

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

9eae73df