This folder contains example for GEMM using ck_tile tile-programming implementation. Currently, it only supports the basic feature of the CK Tile GEMM, but creates the placeholders for the future support on different GEMM pipeline and different GEMM modules. In the near future, we will gradually migrate all the GEMM features from old CK to CK Tile.
This folder contains example for batched GEMM using ck_tile tile-programming implementation.
## build
## build
```
```
...
@@ -8,24 +8,27 @@ This folder contains example for GEMM using ck_tile tile-programming implementat
...
@@ -8,24 +8,27 @@ This folder contains example for GEMM using ck_tile tile-programming implementat
mkdir build && cd build
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
sh ../script/cmake-ck-dev.sh ../ <arch>
sh ../script/cmake-ck-dev.sh ../ <arch>
make tile_example_gemm_basic -j
make tile_example_batched_gemm -j
```
```
This will result in an executable `build/bin/tile_example_gemm_basic`
This will result in an executable `build/bin/tile_example_batched_gemm`
## example
## example
```
```
args:
args:
-b batch size (default:1)
-m m dimension (default:256)
-m m dimension (default:1024)
-n n dimension (default:128)
-n n dimension (default:2048)
-k k dimension (default:128)
-k k dimension (default:64)
-stride_a Tensor A stride (default:128)
-stride_a Tensor A stride (default:0)
-stride_b Tensor B stride (default:128)
-stride_b Tensor B stride (default:0)
-stride_c Tensor C stride (default:128)
-stride_c Tensor C stride (default:0)
-batch_stride_a Batch A stride (default:32768)
-v 0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
-batch_stride_b Batch B stride (default:16384)
-e Absolute error tolerance (default:1e-5)
-batch_stride_c Batch C stride (default:32768)
-prec data type. fp16/bf16/fp8/bf8 (default:fp16)
-batch_count Batch count (default:16)
-warmup number of iterations before benchmark the kernel (default:10)
-v 0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
-repeat number of iterations to benchmark the kernel (default:100)
-e Absolute error tolerance (default:1e-5)
-timer gpu:gpu timer, cpu:cpu timer (default:gpu)
-prec data type. fp16/bf16/fp8/bf8 (default:fp16)
-warmup number of iterations before benchmark the kernel (default:10)
-repeat number of iterations to benchmark the kernel (default:100)