Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
tilelang
Commits
b2acfc37
Unverified
Commit
b2acfc37
authored
Oct 19, 2025
by
Lei Wang
Committed by
GitHub
Oct 19, 2025
Browse files
[Benchmark] Add matmul FP16 benchmark results (#1067)
parent
17bd0a6c
Changes
4
Show whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
49 additions
and
8 deletions
+49
-8
benchmark/matmul/README.md
benchmark/matmul/README.md
+36
-0
benchmark/matmul/benchmark_matmul.py
benchmark/matmul/benchmark_matmul.py
+3
-0
benchmark/matmul_fp8/README.md
benchmark/matmul_fp8/README.md
+7
-7
benchmark/matmul_fp8/benchmark_matmul.py
benchmark/matmul_fp8/benchmark_matmul.py
+3
-1
No files found.
benchmark/matmul/README.md
0 → 100644
View file @
b2acfc37
# FP16 Matmul Benchmark (8192×8192)
This document records the throughput achieved by
`benchmark_matmul.py`
when multiplying FP16 matrices sized
`M = N = 8192`
across different
`K`
dimensions using the default autotuning search space.
## Environment
-
Repository commit:
`17bd0a6c651f599bec1397e0b91830c3ddc93076`
-
GPUs:
`NVIDIA H800 SXM`
on driver
`560.35.05`
## How to Reproduce
```
bash
cd
benchmark/matmul
python -
<<
'
PY
'
from benchmark_matmul import matmul
M = 8192
N = 8192
for K in [256, 512, 1024, 2048, 4096, 8192, 16384]:
res = matmul(M, N, K, False)
tflops = 2 * M * N * K / res.latency * 1e-12
print(f"K={K:5d} latency={res.latency:.6f}s TFlops={tflops:.3f}")
PY
```
## Results
| K | Latency (s) | Throughput (TFLOPs) |
|-------|-------------|---------------------|
| 256 | 0.089056 | 386 |
| 512 | 0.132064 | 520 |
| 1024 | 0.218816 | 628 |
| 2048 | 0.390112 | 705 |
| 4096 | 0.746752 | 736 |
| 8192 | 1.449888 | 758 |
| 16384 | 2.871168 | 766 |
benchmark/matmul/benchmark_matmul.py
View file @
b2acfc37
...
...
@@ -2,6 +2,7 @@ import argparse
import
itertools
import
logging
import
tilelang
import
tilelang.language
as
T
from
tilelang.autotuner
import
autotune
from
tilelang
import
jit
...
...
@@ -187,6 +188,8 @@ def matmul(
# Enable (or disable) swizzling optimization
T
.
use_swizzle
(
panel_size
=
10
,
enable
=
enable_rasteration
)
# to utilize swizzle tma layout
T
.
annotate_layout
({
C_shared
:
tilelang
.
layout
.
make_swizzled_layout
(
C_shared
)})
# Clear out the accumulation buffer
T
.
clear
(
C_local
)
...
...
benchmark/matmul_fp8/README.md
View file @
b2acfc37
...
...
@@ -27,10 +27,10 @@ PY
| K | Latency (s) | Throughput (TFLOPs) |
|-------|-------------|---------------------|
| 256 | 0.0
91488
|
376
|
| 512 | 0.
1104
96 |
622
|
| 1024 | 0.1
4825
6 |
927
|
| 2048 | 0.2
34080
| 1
174
|
| 4096 | 0.3
98944
| 1
378
|
| 8192 | 0.7
52416
| 1
461
|
| 16384 | 1.4
43808
| 15
23
|
| 256 | 0.0
60352
|
569
|
| 512 | 0.
0800
96 |
858
|
| 1024 | 0.1
2169
6 |
1129
|
| 2048 | 0.2
04672
| 1
343
|
| 4096 | 0.3
74816
| 1
467
|
| 8192 | 0.7
29664
| 1
507
|
| 16384 | 1.4
27264
| 15
41
|
benchmark/matmul_fp8/benchmark_matmul.py
View file @
b2acfc37
import
argparse
import
itertools
import
logging
import
tilelang
import
tilelang.language
as
T
from
tilelang.autotuner
import
autotune
from
tilelang
import
jit
...
...
@@ -190,6 +190,8 @@ def matmul(
# Enable (or disable) swizzling optimization
T
.
use_swizzle
(
panel_size
=
10
,
enable
=
enable_rasteration
)
# to utilize swizzle tma layout
T
.
annotate_layout
({
C_shared
:
tilelang
.
layout
.
make_swizzled_layout
(
C_shared
)})
# Clear out the accumulation buffer
T
.
clear
(
C_local
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment