Benchmark FBGEMM Grouped GEMM in both Triton and CUDA version and SGLang Triton Grouped GEMM, it will be used to compare the bandwidth of different implementations.
The theoretical peak bandwidth of H200 is 4.8 TB/s. Taking batch_size 256 as an example, the bandwidth of FBGEMM Triton Grouped GEMM FP8 is 3704.841339 GB/s, the bandwidth of FBGEMM CUTLASS F8F8BF16 Rowwise is 3042.626402 GB/s, and the bandwidth of SGLang Grouped GEMM FP8 is 2254.725030 GB/s. Therefore, FBGEMM Triton Grouped GEMM FP8 achieves 77.9% of H200's theoretical peak bandwidth, FBGEMM CUTLASS F8F8BF16 Rowwise achieves 63.4% of H200's theoretical peak bandwidth, and SGLang Grouped GEMM FP8 achieves 46.9% of H200's theoretical peak bandwidth.
warnings.warn("TMA load is disabled as there is no TMA descriptor support!")
ifUSE_TMA_STOREandnotHAS_TMA_DESC:
USE_TMA_STORE=False
warnings.warn("TMA store is disabled as there is no TMA descriptor support!")
# TODO(shikaili): Check the readniess of WS on ROCm side in Meta's Triton.
ifuse_warp_specializationandtorch.version.hip:
warnings.warn("Warp specialization is disabled as it is not supported on ROCm.")
use_warp_specialization=False
ifuse_warp_specializationandnot_HAS_WS_SUPPORT:
warnings.warn(
"Warp specialization is disabled as the Triton build in current environment doesn't have such support. Please build from https://github.com/facebookexperimental/triton/tree/ws-3.2.x to enable it for best performance on Nvidia's SM90 GPUs."