Commits · 0cb2e06ddcdf6b1414cbead4262b76b5d4391e93 · gaoqiong / composable_kernel_ROCM

26 Jun, 2024 1 commit

[CK_TILE] fmha forward split-kv + combine kernels (#1338) · 0cb2e06d

Po Yen Chen authored Jun 26, 2024



* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

* Add num_splits option and dummy split-kv api method

* Generate fmha_fwd_splitkv()

* Add SplitKV kernel codegen logics

* Add SplitKV combine kernel codegen logics

* Fix mismatched return type

* Clean-up code

* Replace sentinel value before storing

* Fix wrong layout of LSE/LSEacc/Oacc

* Format codes

* Fix o_acc memory error

* Fix wrong kBlockSize used in policy

* Reduce # of combine kernels

* Fix split-kv combine kernel name

* Fix wrong LDS indexing logics

* Fix wrong loop counter step logic

* Undo vector size changes

* Remove no-longer used field

* Remove in-consistent comment

* Remove debug statements in example

* Remove more debug statements

* Add constness to local variables

* Clearn up generate.py

* Fix unstable clang-format comment

* Remove unused include directive

* Use shorter template parameter name

* Enable non-split-kv blobs

* Update license date

* Print num_splits conditionally

* Undo disabling data types

* Remove unnessary tile size for fp8

* Fix wrong pipeline args for fp8

* Fix example output format

* Remove more debug code in combine pipeline

* Add stride kernel arguments for LSE/O acc workspace

* Re-order split-kv pipeline call operator arguments

* Pass LSE/O strides in kernel argument

* Re-order pipeline call operator arguments

* Use tensor_descriptor to locate LSEacc elements

* Support providing invalid element for tensor view

* Set invalid element value for LSEacc tensor view

* Remove hand-written store_tile() code

* Remove necessary value-overwrite logic

* Add transposed lds descriptor

* Support load_tile() for tile_window_with_static_lengths<>

* Undo removing necessary value-overwrite logic

* Use read descriptor to locate lds elements

* Simplify pipeline source code

* Add constraint to kMaxSplits

* Default use kMaxSplits=64 in generate.py

* Revert "Add constraint to kMaxSplits"

This reverts commit 0a2132d758042e6fb0292f4e354909b8a4d1c118.

* Revert "Default use kMaxSplits=64 in generate.py"

This reverts commit c7d9c80b77320aec6559222bed7d47adcaefe4e3.

* Decide alignment by the padding parameter

* Remove no-longer used utility functions

* Remove not-working code

* Add comment & remove no-longer used code

* Fix computation errors

* Add heuristic to override num_splits option

* Add constraint to kMaxSplits

* Fix compilation error

* Clean up pipeline code

* Wrap pointer access as lambda function

* Rename confusing methods

* Use kLogMasSplits as template parameter

* Finish splitkv combine kernel codegen

* Update kMaxSplits limit

* Use smaller kM0 for splitkv combine kernel

* Ignore droupout flag in splitkv pipeline

* Unify flag usage

* Add back flag kStoreLSE

* Merge lambda calls in pipeline

* Fix compilation errors

* Avoid all empty splits

* Always check for empty loop in splitkv pipelines

* Re-order parameters

* Remove redundant p_drop option check

* Add traits/problem for fwd splitkv kernel

* Conditionally enable uneven split boundary checks

* Add comment for the splitkv traits field

* Change even split criteria

* Re-order statements

* Refine occupancy value for hdim=128&256

* Refine occupancy value for hdim=32&64

* Remove redundant kernel argument

* Separate fmha bwd codegen logics

* Separate fmha fwd codegen logics

* Remove redundant direction parameter in fwd&bwd codegen logics

* Support generate multiple APIs for an example

* Let 'api' an alias of 'direction' option

* Remove choices for the 'direction' option

* Use dictionary to config all the functions

* Move fmha splitkv codegen logics to other file

* Add fwd_splitkv api for tile_example_fmha_fwd

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

0cb2e06d

25 Jun, 2024 1 commit

CK Instance Gen (#1145) · 3e9711f0

arai713 authored Jun 25, 2024



* Format

* Format

* Format

* Remove const

* Use the right template

* Format

* Format

* add row/col instances

* Add missing file

* fixed

* fixing block to etile error

* Format

* Updates

* Format

* fixed rrr layout

* generating a sample JSON file: currently contains includes, prologue/epilogue and instances

* version where the json is passed into the instances to generate a key

* updated run function to just launch kernel

* updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer

* adding in testing files

* cleaned up comments, still need to work on including header files

* removed unneeded files

* removed/commented out JSON implementation

* added fusion(prologue/epilogue) into instance generation

* working on instance selection

* added instance selection, need to fix instance validation

* removed block2etile map validity check for testing purposes

* test running: failing due to incorrect files/input

* all grid descs/ptrs completed, but device file not found

* Update test and embed modules

* Restore older version

* added convolution operation, written test, debugging generated code for compilation

* attempting to include CK in host directory: _Float16 error

* CK header file issues

* slight fix

* don't crash when hip can't report total memory

* dump generated code to a file

* changing sizes

* creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed

* some fixes to call the device code

* separating test files for conv and gemm

* completed arg ptr, now have linking errors

* clang format fix

* resolved linker issues in conv test

* remove dependency on libutility from ck

* resolved num dim error

* properly passing arg ptr, errors with passing typenames: redefinition/redeclaration

* undo the commenting of device function

* hand created kernel code to find rtc issues

* dump the full src to file

* resolved redeclaration errors, cleaned up errors for Amber's kernel code

* debugging purposes: redeclaration error

* config files

* resolved errors for NumTensor and redeclaration, formatted version.h

* resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type

* WIP: close to getting kernel compiled

* WIP: fixing rtc errors

* fixed sequence errors, formatting, still one error with run fcn

* yay: kernel compiles and runs

* updated templated/generated version to run and compile

* minor fixes

* working generated example, resolved memory access error due to padding

* adding in reference kernel, validation failing against reference

* debugging: printing kernel argsz

* reduced error in results

* debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues

* working validation (using reference convolution) with prologue function for both hard-coded and generated version

* WIP: create an alt version that creates Argument on the device

* wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments

* wip: making necessary methods device code

* added grid descs, working on grid pointers, errors with stl numerics

* wip: updating kernel args - issue, replacing some std functions

* replaced std::accumulate call with temp hardcoded version

* wip: args causing memory issue

* Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes

* adding object file dump

* temporary hardcoding of grid size, can remove device op inst + arg ptr

* minor fix for grid size

* added modified example where arg ptr is created on the device for generated version as well

* removed device op instance and arg ptr from modified examples

* moving device op file for testing purposes and to properly build CK

* commenting out print-outs

* adjust compiler args to produce a valid ELF file

* temporary removal of validation

* reverting compiler args back for working example

* retrieve necessary arguments from generated template parameters in correct format

* calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly

* scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example

* temporary change to generate ELF format binary object file

* removed unecessary code, added comments

* formatting fix

* cleaned up code, added new tests, restructured library: move helper into CK

* refactored launch parameter calculation to be more concise

* renamed files and variables for more clarity/uniformity

* more code cleaning, removed debug statements

* moved majority of my files into codegen directory, running properly

* updated Embed.cmake(string_view) in codegen directory

* updated host directory to match Embed.cmake as well

* added old tests in

* updated instance generation methods to be more concise

* removed layout from launch parameter calculation

* working test

* fixed issue with verification, all instances working

* updated verification in other tests

* removed duplicate matrix padder file, removed code dumps

* removed old hard-coded tests

* removed old host directory, all files in codegen directory now

* fixed copyright in files

* commenting out validation

* renamed files

* made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code

* updated headers

* removing duplicate file for fwd conv to gemm, merging with original file

* fix building codegen with clang++ directly

* resolving build error from conv_fwd_to_gemm

* fix for previous error

* renaming tests

* created common test file

* cleaned up code, added comments

* renamed device op

* fixed typos in comments

* removed extra space

* code cleanup: resolving Amber's comments

* removed wrapper struct for matrix padder, fixed template

* cleaned up if statements for better readability

---------
Co-authored-by: Paul <pfultz2@yahoo.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

3e9711f0

24 Jun, 2024 1 commit

layernorm2d forward (#1339) · cb138394

rocking authored Jun 24, 2024



* Add layernorm2d forward

* Refind file path

* clang format

* Exclude ck_tile op from all

* use add_executable instead

* refactor layernorm2d_fwd example

---------
Co-authored-by: carlushuang <carlus.huang@amd.com>

cb138394

22 Jun, 2024 1 commit

Add instances of grouped convolution 3d forward with a ConvScale element-wise... · 05b10e0e

Andriy Roshchenko authored Jun 21, 2024


Add instances of grouped convolution 3d forward with a ConvScale element-wise op for bf8@bf8->fp8 (#1326)

We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.

* Included an example.
* Added instances.
* Added a client example.

---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

05b10e0e

21 Jun, 2024 2 commits

WA for rocm-6.2+ s constrait for buffer resource (#1346) · fa129c1a
carlushuang authored Jun 22, 2024
```
* WA for rocm-6.2+ s constrait for buffer resource

* add missing memory clobber
```
fa129c1a

Fix cmake warnings (#1342) · 510325a4

Bartłomiej Kocot authored Jun 21, 2024

* Cmake add -Wno-nvcc-compt

* Remove template without initialization list

* dpp remove template without init list

* Fixes

510325a4

20 Jun, 2024 3 commits
- Fix FA bwd alibi+causal NaN errors (#1352) · 1da802bd
  Dan Yao authored Jun 20, 2024
```
* fix bwd alibi nan error

* fix datatype

---------

Co-authored-by: danyao12 <danyao12>
```
  1da802bd
- Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348) · 0162a5f6
  ThruptiRajLakshmanaGowda authored Jun 20, 2024
```
* Initial Push

* First Push

* Fixed Clang format

* Resolve merge conflict

* Addressed review comments

* Addressed review comments

* Addressed review comments
```
  0162a5f6
- Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350) · e3f44659
  Qianfeng authored Jun 20, 2024
  
  e3f44659
19 Jun, 2024 2 commits
- Remove gfx900 and gfx906 from default target device to reduce package size (#1351) · 8db331a5
  zjing14 authored Jun 19, 2024
  
  8db331a5
- Hacking ck_tile fmha Dropout facility (#1344) · 1973903f
  Qianfeng authored Jun 19, 2024
```
* Add NullBlockDropout to be used when kHasDropout is false

* Change to BlockDropout::Run() for forward to reduce conditional checkings

* Re-format files

---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
```
  1973903f
18 Jun, 2024 3 commits
- Add read_first_lane function for int64 (#1347) · 8faec23c
  Bartłomiej Kocot authored Jun 18, 2024
  
  8faec23c
- Switch to universal gemm in grouped gemm tile loop (#1335) · e2d13920
  jakpiase authored Jun 18, 2024
```
* switch to universal gemm in grouped gemm tile loop

* minor fixes

* add reviewers comments

---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
```
  e2d13920
- Fix continous dim selection in contraction (#1336) · 933951ed
  Bartłomiej Kocot authored Jun 18, 2024
```
* Fix continous dim selection in contraction

* Fixes
```
  933951ed
17 Jun, 2024 2 commits
- [CK_TILE][FA] using pk f16_f32 (#1343) · 17ed368f
  carlushuang authored Jun 17, 2024
```
* [CK_TILE][FA] using pk f16_f32

* correct a error
```
  17ed368f
- disabled lds direct load inline asm (#1331) · e0210316
  zjing14 authored Jun 16, 2024
  
  e0210316
14 Jun, 2024 1 commit
- Support large tensors in grouped conv fwd (#1332) · dc1e9c5d
  Bartłomiej Kocot authored Jun 14, 2024
```
* Support large tensors in grouped conv fwd

* Multi ABD fixes

* Fix calculate element space size
```
  dc1e9c5d
13 Jun, 2024 1 commit

Fix to the using of static_for in amd_buffer_addressing.hpp (#1337) · 37a347e3

Qianfeng authored Jun 13, 2024



* Add insert_dummy_dep_per_dword over-loading for length 64

* Fix insert_dummy_dep_per_dword and remove over-loading for length 64

* Remove blank lines

---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

37a347e3

12 Jun, 2024 1 commit
- Add instances for grouped conv fwd 3d with ConvScale for fp8@bf8->fp8 (#1325) · acda4c5a
  Rostyslav Geyyer authored Jun 12, 2024
```
* Add fp8 bf8 conv example

* Add instances

* Add client example

* Add random scale values

* Format
```
  acda4c5a
11 Jun, 2024 1 commit
- Fix nhwgc f16 wmma instances (#1328) · 5fc1bee4
  Bartłomiej Kocot authored Jun 11, 2024
  
  5fc1bee4
10 Jun, 2024 1 commit

Add a convinvscale op, related instances and examples (#1307) · ce66277a

Rostyslav Geyyer authored Jun 10, 2024



* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------
Co-authored-by: illsilin <Illia.Silin@amd.com>

ce66277a

07 Jun, 2024 1 commit

Bump rocm-docs-core from 1.3.0 to 1.4.0 in /docs/sphinx (#1327) · 8f5690c4

dependabot[bot] authored Jun 06, 2024

Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.3.0 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.3.0...v1.4.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

8f5690c4

05 Jun, 2024 3 commits

Integrate universal gemm with conv forward (#1320) · ac58cc5d

Bartłomiej Kocot authored Jun 05, 2024

* Integrate universal gemm with conv fwd

* Fix conv fwd wmma test

* Fix instances

* Remove direct load check

ac58cc5d

Bump rocm-docs-core from 1.2.1 to 1.3.0 in /docs/sphinx (#1324) · ba82beb9

dependabot[bot] authored Jun 05, 2024

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.2.1 to 1.3.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.2.1...v1.3.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

ba82beb9

Add a scale op, related instances and examples (#1242) · cb0645be

Rostyslav Geyyer authored Jun 04, 2024



* Add a scale op

* Update the element op

* Add instances

* Add an example

* Add a client example

* Add a flag check

* Revert flag check addition

* Fix flag check

* Update d strides in example

* Update d strides in client example

* Apply suggestions from code review

Update copyright header
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Move the example

* Move the client example

* Update element op

* Update example with the new element op

* Add scalar layout

* Update example

* Update kernel for scalar Ds

* Revert kernel changes

* Update element op

* Update example to use scales' pointers

* Format

* Update instances

* Update client example

* Move element op to unary elements

* Update element op to work with values instead of pointers

* Update instances to take element op as an argument

* Update examples to use random scale values

---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

cb0645be

04 Jun, 2024 2 commits

CK Tile FA Training kernels (#1286) · 2cab8d39

Dan Yao authored Jun 05, 2024



* FA fwd dropout

* FA bwd

* epilogue reuse

* CMakeLists update

* [CK_TILE] support alibi (#1269)

* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>

* now fwd/bwd can build

* bwd alibi

* add bwd validation stream_config

* update generated filenames

* update bwd kernel launch

* CK_TILE_HOST_DEVICE in philox

* Transpose -> transpose

* format

* format

* format

* Generate the instance for FA required

* format

* fix error in WarpGemm

---------

Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>

2cab8d39

Bump rocm-docs-core from 1.2.0 to 1.2.1 in /docs/sphinx (#1322) · 76827d82

dependabot[bot] authored Jun 03, 2024

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.2.0 to 1.2.1.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.2.0...v1.2.1

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

76827d82

03 Jun, 2024 1 commit
- disable the hipTensor test by default, only run once daily (#1321) · 3fa7e2a6
  Illia Silin authored Jun 03, 2024
  
  3fa7e2a6
01 Jun, 2024 1 commit

Post-merge fix of PR 1300 (#1313) · 6fb1f4e0

zjing14 authored Jun 01, 2024

* add f8 gemm with multiD for both row/col wise

* change compute_type to fp8

* changed tuning parameters in the example

* add rcr example

* post-merge fix

* fix

* reduce init range

6fb1f4e0

28 May, 2024 4 commits

Build CK library for all supported targets. (#1312) · 34f3dfdd

Illia Silin authored May 28, 2024

* test library build for all supported targets

* increase the number of threads to build lib in CI to 64

34f3dfdd

Bump rocm-docs-core from 1.1.3 to 1.2.0 in /docs/sphinx (#1311) · 66de8a02

dependabot[bot] authored May 28, 2024

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.1.3 to 1.2.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.1.3...v1.2.0

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

66de8a02

add f8 gemm multiD with both row/col wise scale (#1300) · 80db62f0

zjing14 authored May 28, 2024

* add f8 gemm with multiD for both row/col wise

* change compute_type to fp8

* changed tuning parameters in the example

* add rcr example

80db62f0

[CK_TILE] support group from cmdline (#1295) · 5055b3bd

carlushuang authored May 28, 2024

* support cmdline seqlen decode

* silent print

* update readme

* update kernel launch 3d

* update tile partitioner

* fix spill for bf16

* modify based on comment

* modify payload_t

* fix bug for alibi mode

* fix alibi test err

* refactor kernel launch, support select timer

* add missing file

* remove useless code

* add some comments

5055b3bd

23 May, 2024 3 commits

Enable external CI pipeline triggers (#1310) · 02fa2c29
Joseph Macaranas authored May 23, 2024

02fa2c29
Split the gemm_multi_abd instances. (#1306) · ec2bae27
Illia Silin authored May 23, 2024
```
* split the gemm_multi_abd instances

* update the dates
```
ec2bae27

Bump rocm-docs-core from 1.1.2 to 1.1.3 in /docs/sphinx (#1308) · 06a9b72c

dependabot[bot] authored May 23, 2024

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.1.2 to 1.1.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.1.2...v1.1.3

)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

06a9b72c

22 May, 2024 3 commits

Make the library which generates CK instances for pytorch2 inductor's CK backend usage · 29e58d5b

Max Podkorytov authored May 21, 2024

Also bundle the CK library and include files with the pip package.

The package is pip-installable with
`pip install
git+https://github.com/tenpercent/composable_kernel@enable-pip`

(substitute the repo path and branch if necessary)

Testing:

`myenv/bin/python3 -m ck4inductor.universal_gemm.gen_instances`

(prints a list of instances)

`tree myenv/lib/python3.12/site-packages/ck4inductor`

(observe the list of sources along the installed package)

29e58d5b

Optimize grouped conv bwd weight for small M and N (#1303) · fd72380a
Bartłomiej Kocot authored May 22, 2024
```
* Optimize grouped conv bwd weight for small M and N

* Fixes
```
fd72380a

Select appropriate GPU targets for instances, tests, and examples. (#1304) · 7b027d56

Illia Silin authored May 22, 2024

* set individual gpu targets for instances, examples, tests

* fix path to hip compiler

* fix path to hip compiler once more

* aggregate device macros in ck_tile config header

* fix the cmake logic for instances

* fix clang format

* add gfx900 and gfx906 to default set of targets

7b027d56

21 May, 2024 1 commit
- Move grouped conv fwd client examples (#1299) · 204da9c5
  Rostyslav Geyyer authored May 21, 2024
```
* Move grouped conv fwd client examples

* Update existing examples

* Format
```
  204da9c5