Commits · b2d5cf8a78df291d40354e8a216168ea5461eca1 · gaoqiong / composable_kernel

08 Aug, 2023 1 commit
- GQA-4 example · b2d5cf8a
  aska-0096 authored Aug 08, 2023
  
  b2d5cf8a
07 Aug, 2023 1 commit
- MQA implementation · 73e475d8
  aska-0096 authored Aug 07, 2023
  
  73e475d8
03 Aug, 2023 3 commits
- Merge pull request #827 from ROCmSoftwarePlatform/fpAintB_clear · c3cba9c9
  Haocong WANG authored Aug 03, 2023
```
fpAintB_gemm implementation
```
  c3cba9c9
- Fp16AInt8B_GEMM sanity · b5083bfe
  aska-0096 authored Aug 03, 2023
  
  b5083bfe
- debug code enabled · 5cf73a5e
  aska-0096 authored Aug 03, 2023
  
  5cf73a5e
01 Aug, 2023 1 commit
- Temp save · 32bac6f3
  aska-0096 authored Aug 01, 2023
  
  32bac6f3
28 Jul, 2023 1 commit
- Sanity pass. · 66e61076
  aska-0096 authored Jul 28, 2023
  
  66e61076
25 Jul, 2023 1 commit
- fpAintB kernel compile pass · 0c51a35e
  aska-0096 authored Jul 25, 2023
  
  0c51a35e
20 Jul, 2023 1 commit
- Temp save · febd76e4
  aska-0096 authored Jul 20, 2023
  
  febd76e4
07 Jul, 2023 1 commit
- add gemm fp16 instances · 6e6c5355
  aska-0096 authored Jul 07, 2023
  
  6e6c5355
26 Jun, 2023 1 commit
- clang format · 1fb4a474
  aska-0096 authored Jun 26, 2023
  
  1fb4a474
25 Jun, 2023 1 commit
- fix gemm · fd9e80c5
  aska-0096 authored Jun 25, 2023
  
  fd9e80c5
20 Jun, 2023 2 commits
- separate array base and vector base attention tensor transformation · 8053bca3
  aska-0096 authored Jun 20, 2023
  
  8053bca3
- API fix of gridwisegemmpipeline · b3770638
  aska-0096 authored Jun 20, 2023
  
  b3770638
19 Jun, 2023 3 commits
- clang format · 35e5c532
  aska-0096 authored Jun 19, 2023
  
  35e5c532
- part2 of previous commit · b010b095
  aska-0096 authored Jun 19, 2023
  
  b010b095
- Fix errors in · 43777959
  aska-0096 authored Jun 19, 2023
```
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array
```
  43777959
15 Jun, 2023 1 commit
- clang-format · 83d926dc
  aska-0096 authored Jun 15, 2023
  
  83d926dc
13 Jun, 2023 5 commits
- Bug fix: double lds skip · 6c1aa33a
  aska-0096 authored Jun 13, 2023
  
  6c1aa33a
- deprecate inline asm wmma · d44f6660
  aska-0096 authored Jun 13, 2023
  
  d44f6660
- Merge branch 'e2e_kernellib' of https://github.com/aska-0096/navi3x_ck into e2e_kernellib · 823c8801
  aska-0096 authored Jun 13, 2023
  
  823c8801
- Merge branch 'develop' of... · e305e41e
  aska-0096 authored Jun 13, 2023
```
Merge branch 'develop' of https://github.com/ROCmSoftwarePlatform/composable_kernel into e2e_kernellib
```
  e305e41e
- AIT Attention API refactor (#8) · efee4541
  Haocong WANG authored Jun 13, 2023
```
* sanity pass

* sanity pass 2

* confirm significant performance regression.

* turn on all instances

* turn off instance format

* Fix bug & tunning & format

* DML meta, self_attn+cross_attn

* sanity pass

* remove useless flag

* update tile and problem size used in AIT attention

* bug fix in grouped conv supporting check
```
  efee4541
12 Jun, 2023 4 commits

Fix arg order (#751) · a35456a3
Rostyslav Geyyer authored Jun 12, 2023

a35456a3

Add DeviceBatchedGemmMultipleD_Dl (#732) · fc9f9756

Bartłomiej Kocot authored Jun 12, 2023

* Add DeviceBatchedGemmMultipleD_Dl

* Fix batched_gemm tests

* Fix comments

* test_batched_gemm_multi_d fixes

* Fix args for isSupported batchedGemmMultipleDDl

* Disable tests for gfx90a

fc9f9756

Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738) · 7c24654c

Po Yen Chen authored Jun 12, 2023

* Fix wrong pointer type

* Rename type trait get_unsigned_int<> to get_carrier<>

* Add 3-bytes carrier type

* Add missing __device__ specifier

* Rename template non-type parameter

* Leave the rest byte uninitialized

* Avoid invoking (host) STL algorithms

* Remove unnecessary 'inline' specifier

* Extract common logic out as helper method

* Hide dummy member function

* Add missing __device__ specifier

7c24654c

Fix flash attn mask bug (#733) · 0ede66de

ltqin authored Jun 12, 2023



* add check input parameter

* add instance for vector load = 1

* move gerneral instance to first pos

* fix read bias code

* regular code for bias load

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

0ede66de

08 Jun, 2023 1 commit
- support dynamic buffer using memory coherence glc_slc bit from template (#725) · 016ebaa7
  carlushuang authored Jun 08, 2023
  
  016ebaa7
07 Jun, 2023 1 commit

Update docker (#744) · 1dd455d6

Illia Silin authored Jun 07, 2023

* update dockerfile to build rocm5.6 rc3

* fix couple of docker issues

1dd455d6

02 Jun, 2023 1 commit
- fix clang format (#740) · 40365904
  Illia Silin authored Jun 02, 2023
  
  40365904
01 Jun, 2023 2 commits

replace hipMemcpy with hipMemcpyWithStream (#734) · e2ebc8e7
who who who authored Jun 02, 2023

e2ebc8e7

Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723) · 9eae73df

Po Yen Chen authored Jun 02, 2023



* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Share same name for kernel interfaces

* Fix wrong boundar calculation logic

* Leave the third template arg for compatibility

* Remove unnecessary parameters

* Fix wrong error message (for type name)

* Create descriptor on device side

* Fix wrong debug message

* Remove no-longer used data members

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Reject unsupported argument

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Update function calls

* Reorder statements

* Re-format files

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

9eae73df

31 May, 2023 2 commits

update copyright headers (#726) · b94fd0b2
Illia Silin authored May 31, 2023

b94fd0b2

Add class type support for __builtin_amdgcn_readfirstlane() (#711) · 582e31e8

Po Yen Chen authored May 31, 2023

* Add overloaded version of __builtin_amdgcn_readfirstlane()

* Remove 'static' specifiers

* Remove more 'static' specifier

* Replace unsigne char by std::byte

* Add 'const' specifier to never changing variable

* Add 'inline' specifier to funcion definition

* Fix wrong boundar calculation logic

* Rename type trait

* Remove std:: qualifier from standard types

* Replace 'size_t' by 'unsigned'

* Use type alias to hint usage

* Replace static_for<> by ordinary 'for' loop

* Rename readfirstlane() to amd_wave_read_first_lane()

* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp

* Reorder statements

582e31e8

30 May, 2023 4 commits

fix wmma gemm int8; add grouped conv int8 example (#716) · 6eef0755
Haocong WANG authored May 30, 2023

6eef0755

Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> (#696) · 1344a0f2

Po Yen Chen authored May 30, 2023



* Remove M/N/KPad local variables

* Use M/N/KPad to name padded lengths

* Replace duplicated local variable by parameters

* Rename variables M/N/KRaw to M/N/K

* Move AK0/BK0 compute logic into GridwiseGemm

* Use macro to shorten code

* Move CalculateGridSize() logic into GridwiseGemm

* Add comment to credit the implementation source

* Reuse the existing implementation

* Remove no-longer used data members

* Remove elementwise-op objects from interfaces

* Reserve kernel arg as whole object in interfaces

* Remove redundant data member

* Make 3rd type parameter optional

* Remove unnesscary type parameters

* Remove no-longer used descriptor-creation methods

* Move kernel arg type definition into GridwiseGemm

* Add macro to switch between code sections

* Move argument field computing logic into device op side

* Make utility method 'static'

* Declare special methods

* Unify MakeArgument() usage

* Adapt the new GridwiseGemm interface

* Push-down class 'GridwiseGemm::Argument' fields

* Remove no-longer used methods

* Add unused parameters

* Force copying parameters in 'Embed' ctor

* Remove no-longer used descriptors

* Fallback change on BaseArgument

* Remove macro 'INTEGER_DIVIDE_CEIL'

* Make variable naming more consistent

* Make sure methods are only invoked on right place

* Remove tailing underscore in public attribute name

* Remove necessary methods

* Hide computing logic of derived attributes

* Make new 'Embed' ctor only available for device code

* Make sure 'Embed' type args are not references

* Move check for karg.K into CheckValidity()

* Remove more integer division logic form device code

* Undo changes on Embed

* Separate 'Problem' concept out from 'Argument'

* Share same name for kernel interfaces

* Reject unsupported argument

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

1344a0f2

Multiple fixes to GroupedGemm+SplitK (#707) · 70e4eb56

Adam Osewski authored May 30, 2023



* Add license header.

* Reduce number of logged output. Add constant initialization.

* Add functional tests for grouped_gemm with different kbatch value.

* Add debug log informations + remove unused code.

* Don't pass kbatch to CalculateKPadded.

* Turn on logging in grouped gemm and gemm splitk profiler

* Debug: limit number of test cases to run;

* Log more information and initialize with constant value.

* Turn on DEBUG_LOG

* Add more debug log informations.

* Limit the number of instances to compile.

* Use GridwiseGemmPipeline

* Use KBatch to calculate K0

* Multiple DebugLog messages.

* Unit tests for multiple KBatch values.

* Refactoring

* Disable logging
* extract out of if statement KBatch update.

* Uncomment instances.

* Disable DebugLog.

* Use Kbatch when calculate KPadded.

* Fix CGridDesc padding.

* Use available helper functions.

* Uncomment code commented for debuggin.

* Remove unnecessary debug log messages.

* Uncomment previously commented code for debug purposes.

* Add KBatch info to profiler output summary log.

* Add gtests for gemm splitk using ckProfiler API.

* Add more test-cases for different data layout.

* Add more test cases for gemm splitk

* Remove old test.

* Unit tests for MKNK ggemm interface.

* Fix and add more unit-tests.

* Constepxr everything!

* Increase error threshold for fp16 and splitk.

Since we're using fp16 atomic add for splitk there's a
known precision loss.

---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

70e4eb56

Add instances for fp16/int8 Gemm kernels (Navi21) (#717) · c2d7a29d

Bartłomiej Kocot authored May 30, 2023

* Add instances for fp16/int8 Gemm kernels (Navi21)

* Extend instances with smaller tiles

* Fix SrcVectorTensor for km_kn_mn int8

c2d7a29d

24 May, 2023 2 commits

Clean-up the headers (#713) · ac9e01e2

Illia Silin authored May 24, 2023



* fix headers for gpu instances

* remove unused headers

---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>

ac9e01e2

Pool3d fwd (#697) · 76ec0089

rocking authored May 24, 2023

* Expand the base class of pool2d, prepare to share base class with pool3d

* Add pool3d device op

* Add pool3d f16 example

* Refactor the base class. implement generic pooling in the future

* clang format

* get original index in max pooling

* Add outputindex to base class

* Fix dimension

* Add pooling instance

* Use indexType instead

* Remove useless header

* Extract IndexDataType to template

* Extract pooling reference code

* clang format

* clang format

* Fix typo

* Add tensor stride

* Add missing header

* Add index stride and output stride

* Refine naming

* Add type to base class

* Rename file

* Use proper size

* Fix typo

* Refine naming

* Modify the argument into vector.

* Add max pool profiler

* Refine naming

* Support f32 pool

* Fix typo

* Add avg pool2d fwd in profiler

* clang format

* Rename AccDatatype to ComputeDatatype

* Fix init

* test pool

* Extract variable

* Add client example

* Check the pooling dim

* clang format

* Connect argv and arg_parser

* Add found check

* Remove useless header

* Refine naming

* Adjust the order of device_pool_fwd

76ec0089