- 20 Jun, 2025 1 commit
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be further extended in the future to support non-FP16 output as well for FP8 inputs. 2. Currently, the input matrices are initialized with values of 1.0 and 2.0 which makes them less demanding in terms of power. Another future extension could be to enable another fill mode for, say, uniform random numbers between -1 and 1. 3. cuBLAS workspace recommendations are listed under https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Update (June 10, 2025): verified using higher level test driver with these commands: 1. inline: ``` python3 -c " from superbench.benchmarks import BenchmarkRegistry, Platform from superbench.common.utils import logger parameters = ( '--num_warmup 10 --num_steps 50 ' '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 ' '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50' ) context = BenchmarkRegistry.create_benchmark_context( 'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters ) benchmark = BenchmarkRegistry.launch_benchmark(context) logger.info('Result: {}'.format(benchmark.result)) " ``` 2. newly added script: `python3 examples/benchmarks/cublaslt_function.py` --------- Co-authored-by:
Babak Hejazi <babakh@nvidia.com>
-
- 22 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** hipblaslt function benchmark and rebase cublaslt function benchmark.
-
- 20 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add int8 support for cublaslt function.
-
- 22 Mar, 2023 1 commit
-
-
Yifan Xiong authored
Support batch and shape range with multiplication factors in cublaslt gemm benchmark.
-
- 20 Mar, 2023 1 commit
-
-
Yifan Xiong authored
Support FP64/TF32/FP16/BF16 in cublaslt (batch) GEMM.
-
- 03 Jan, 2023 1 commit
-
-
Yifan Xiong authored
Integrate cublaslt-gemm micro-benchmark #451.
-
- 01 Apr, 2022 1 commit
-
-
guoshzhao authored
**Description** Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
-
- 08 Feb, 2022 1 commit
-
-
Ziyue Yang authored
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
-
- 07 Feb, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit does the following to optimize result variance in gpu_copy benchmark: 1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead; 2) Use CUDA events for timing instead of CPU timestamps; 3) Make data checking an option that is not preferred to be enabled in performance test; 4) Enlarge message size in performance benchmark.
-
- 21 Jan, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
-
- 09 Dec, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Unify metric names of benchmarks.
-
- 30 Oct, 2021 1 commit
-
-
Ziyue Yang authored
**Description** This commit does the following: 1) Adds CPU-initiated copy benchmark; 2) Adds dtod benchmark; 3) Support scanning NUMA nodes and GPUs inside the benchmark program; 4) Change the name of gpu-sm-copy to gpu-copy.
-
- 30 Aug, 2021 1 commit
-
-
Ziyue Yang authored
**Description** This commit adds gpu_sm_copy benchmark and related tests.
-
- 27 Aug, 2021 1 commit
-
-
guoshzhao authored
**Description** Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.
-
- 19 May, 2021 1 commit
-
-
guoshzhao authored
* add kernel launch overhead benchmark.
-