- 20 Jun, 2025 1 commit
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be further extended in the future to support non-FP16 output as well for FP8 inputs. 2. Currently, the input matrices are initialized with values of 1.0 and 2.0 which makes them less demanding in terms of power. Another future extension could be to enable another fill mode for, say, uniform random numbers between -1 and 1. 3. cuBLAS workspace recommendations are listed under https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Update (June 10, 2025): verified using higher level test driver with these commands: 1. inline: ``` python3 -c " from superbench.benchmarks import BenchmarkRegistry, Platform from superbench.common.utils import logger parameters = ( '--num_warmup 10 --num_steps 50 ' '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 ' '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50' ) context = BenchmarkRegistry.create_benchmark_context( 'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters ) benchmark = BenchmarkRegistry.launch_benchmark(context) logger.info('Result: {}'.format(benchmark.result)) " ``` 2. newly added script: `python3 examples/benchmarks/cublaslt_function.py` --------- Co-authored-by:
Babak Hejazi <babakh@nvidia.com>
-