- 24 Jun, 2025 1 commit
-
-
guoshzhao authored
**Description** Add FP4 precision support for cublaslt_gemm benchmark. **Major Revision** - Add new type `fp4e2m1` and `__nv_fp4_e2m1`. - For FP4 matmul, precision of MatrixC (add) should be FP16, precision of MatricD (output) should be FP4, otherwise, it will not work. - Add macro `CUDA_VERSION` to resolve the compatibility issue of different CUDA versions. --------- Co-authored-by:
Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net> Co-authored-by:
AVA <39534996+avazr@users.noreply.github.com> Co-authored-by:
Guoshuai Zhao <microsoft@microsoft.com>
-
- 20 Jun, 2025 1 commit
-
-
Babak Hejazi authored
**Description** Enable autotuning as an opt-in mode when benchmarking cublasLt via `cublaslt_gemm` The implementation is based on https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu The behavior of original benchmark command remains unchanged, e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3` The new opt-in options are `-a` (for autotune) and `-I` (for autotune iterations, default is 50, same as the default for `-i`) and `-W` (for autotune warmups, default=20, same as the default for `-w`), e.g.: - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a` - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a -I 10 -W 10` **Note:** This PR also changes the default `gemm_compute_type` for BF16 and FP16 to `CUBLAS_COMPUTE_32F`. **Further observations:** 1. The support matrix of the `cublaslt_gemm` could be further extended in the future to support non-FP16 output as well for FP8 inputs. 2. Currently, the input matrices are initialized with values of 1.0 and 2.0 which makes them less demanding in terms of power. Another future extension could be to enable another fill mode for, say, uniform random numbers between -1 and 1. 3. cuBLAS workspace recommendations are listed under https://docs.nvidia.com/cuda/cublas/#cublassetworkspace Update (June 10, 2025): verified using higher level test driver with these commands: 1. inline: ``` python3 -c " from superbench.benchmarks import BenchmarkRegistry, Platform from superbench.common.utils import logger parameters = ( '--num_warmup 10 --num_steps 50 ' '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 ' '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50' ) context = BenchmarkRegistry.create_benchmark_context( 'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters ) benchmark = BenchmarkRegistry.launch_benchmark(context) logger.info('Result: {}'.format(benchmark.result)) " ``` 2. newly added script: `python3 examples/benchmarks/cublaslt_function.py` --------- Co-authored-by:
Babak Hejazi <babakh@nvidia.com>
-
- 20 Nov, 2023 1 commit
-
-
Yuting Jiang authored
**Description** add int8 support for cublaslt function.
-
- 14 Apr, 2023 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.8.0 to main. **Major Revisions** * Monitor - Fix the cgroup version checking logic (#502) * Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503) * Fix wrong torch usage in communication wrapper for Distributed Inference Benchmark (#505) * Analyzer: Fix bug in python3.8 due to pandas api change (#504) * Bug - Fix bug to get metric from cmd when error happens (#506) * Monitor - Collect realtime GPU power when benchmarking (#507) * Add num_workers argument in model benchmark (#511) * Remove unreachable condition when write host list (#512) * Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513) * Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515) * Docs - Upgrade version and release note (#508) Co-authored-by:
guoshzhao <guzhao@microsoft.com> Co-authored-by:
Ziyue Yang <ziyyang@microsoft.com> Co-authored-by:
Yuting Jiang <yutingjiang@microsoft.com>
-
- 20 Mar, 2023 1 commit
-
-
Yifan Xiong authored
Support FP64/TF32/FP16/BF16 in cublaslt (batch) GEMM.
-
- 03 Jan, 2023 1 commit
-
-
Yifan Xiong authored
Add micro-benchmark for cublaslt fp8 gemm.
-