# FP8 Matmul Benchmark (8192×8192) This document records the throughput achieved by `benchmark_matmul.py` when multiplying FP8 matrices sized `M = N = 8192` across different `K` dimensions. Each measurement relies on the default autotuning search space bundled with the benchmark. ## Environment - Repository commit: `6b1faf71faf18c564f5f77e0f5c1671cd91dfbc3` - GPUs: `NVIDIA H800 SXM` on driver `560.35.05` ## How to Reproduce ```bash cd benchmark/matmul_fp8 python - <<'PY' from benchmark_matmul import matmul M = 8192 N = 8192 for K in [256, 512, 1024, 2048, 4096, 8192, 16384]: res = matmul(M, N, K, False) tflops = 2 * M * N * K / res.latency * 1e-12 print(f"K={K:5d} latency={res.latency:.6f}s TFlops={tflops:.3f}") PY ``` ## Results | K | Latency (s) | Throughput (TFLOPs) | |-------|-------------|---------------------| | 256 | 0.060352 | 569 | | 512 | 0.080096 | 858 | | 1024 | 0.121696 | 1129 | | 2048 | 0.204672 | 1343 | | 4096 | 0.374816 | 1467 | | 8192 | 0.729664 | 1507 | | 16384 | 1.427264 | 1541 |