• Babak Hejazi's avatar
    Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256
    Babak Hejazi authored
    **Description**
    Enable autotuning as an opt-in mode when benchmarking cublasLt via
    `cublaslt_gemm`
    
    The implementation is based on
    https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu
    
    The behavior of original benchmark command remains unchanged, e.g.:
    - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`
    
    The new opt-in options are `-a` (for autotune) and `-I` (for autotune
    iterations, default is 50, same as the default for `-i`) and `-W` (for
    autotune warmups, default=20, same as the default for `-w`), e.g.:
    - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
    -a`
    - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
    -I 10 -W 10`
    
    **Note:** This PR also changes the default `gemm_compute_type` for BF16
    and FP16 to `CUBLAS_COMPUTE_32F`.
    
    **Further observations:** 
    1. The support matrix of the `cublaslt_gemm` could be further extended
    in the future to support non-FP16 output as well for FP8 inputs.
    2. Currently, the input matrices are initialized with values of 1.0 and
    2.0 which makes them less demanding in terms of power. Another future
    extension could be to enable another fill mode for, say, uniform random
    numbers between -1 and 1.
    3. cuBLAS workspace recommendations are listed under
    https://docs.nvidia.com/cuda/cublas/#cublassetworkspace
    
    
    
    Update (June 10, 2025): verified using higher level test driver with
    these commands:
    
    1. inline:
    ```
    python3 -c "                                                                            
    from superbench.benchmarks import BenchmarkRegistry, Platform
    from superbench.common.utils import logger
    
    parameters = (
        '--num_warmup 10 --num_steps 50 '
        '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
        '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
    )
    context = BenchmarkRegistry.create_benchmark_context(
        'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
    )
    benchmark = BenchmarkRegistry.launch_benchmark(context)
    logger.info('Result: {}'.format(benchmark.result))
    "
    ```
    
    2. newly added script: 
    `python3 examples/benchmarks/cublaslt_function.py`
    
    ---------
    Co-authored-by: default avatarBabak Hejazi <babakh@nvidia.com>
    60b13256
cublaslt_function.py 4.15 KB