Commits · f6e65a98305af041ae84e3906f26c310b69f8d8d · tsoc / superbenchmark

23 Oct, 2025 1 commit

Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98

Yuting Jiang authored Oct 23, 2025

**Description**
This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
cublaslt-gemm micro benchmark, enabling detailed kernel analysis
including DRAM throughput, compute throughput, and launch arguments.

**Major Revision**
- Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
- Modifies command execution to use NCU when profiling is enabled
- Updates result parsing to handle both standard and NCU profiled output
formats

f6e65a98

24 Jun, 2025 1 commit

Benchmarks - Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711) · b795477e

guoshzhao authored Jun 24, 2025



**Description**
Add FP4 precision support for cublaslt_gemm benchmark.

**Major Revision**
- Add new type `fp4e2m1` and `__nv_fp4_e2m1`.
- For FP4 matmul, precision of MatrixC (add) should be FP16, precision
of MatricD (output) should be FP4, otherwise, it will not work.
- Add macro `CUDA_VERSION` to resolve the compatibility issue of
different CUDA versions.

---------
Co-authored-by: Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net>
Co-authored-by: AVA <39534996+avazr@users.noreply.github.com>
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>

b795477e

20 Jun, 2025 1 commit

Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256

Babak Hejazi authored Jun 20, 2025

**Description**
Enable autotuning as an opt-in mode when benchmarking cublasLt via
`cublaslt_gemm`

The implementation is based on
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu

The behavior of original benchmark command remains unchanged, e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`

The new opt-in options are `-a` (for autotune) and `-I` (for autotune
iterations, default is 50, same as the default for `-i`) and `-W` (for
autotune warmups, default=20, same as the default for `-w`), e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
-a`
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
-I 10 -W 10`

**Note:** This PR also changes the default `gemm_compute_type` for BF16
and FP16 to `CUBLAS_COMPUTE_32F`.

**Further observations:** 
1. The support matrix of the `cublaslt_gemm` could be furt...

60b13256

22 Nov, 2023 1 commit
- Benchmarks: Micro benchmark - Add hipBLASLt function benchmark (#576) · 79089b65
  Yuting Jiang authored Nov 22, 2023
```
**Description**
hipblaslt function benchmark and rebase cublaslt function benchmark.
```
  79089b65
20 Nov, 2023 1 commit
- Benchmarks: micro benchmarks - add int8 support for cublaslt function (#574) · f53d941a
  Yuting Jiang authored Nov 20, 2023
```
**Description**
add int8 support for cublaslt function.
```
  f53d941a
22 Mar, 2023 1 commit
- Benchmark - Support batch/shape range in cublaslt gemm (#494) · dbeba805
  Yifan Xiong authored Mar 22, 2023
```
Support batch and shape range with multiplication factors in cublaslt
gemm benchmark.
```
  dbeba805
20 Mar, 2023 1 commit
- Benchmarks - Support tensor core precisions in cublaslt gemm (#492) · b808135c
  Yifan Xiong authored Mar 20, 2023
```
Support FP64/TF32/FP16/BF16 in cublaslt (batch) GEMM.
```
  b808135c
03 Jan, 2023 1 commit
- Benchmarks - Integrate cublaslt micro-benchmark (#455) · 616e7a5a
  Yifan Xiong authored Jan 03, 2023
```
Integrate cublaslt-gemm micro-benchmark #451.
```
  616e7a5a
01 Apr, 2022 1 commit

Benchmarks: Add Feature - Provide option to save raw data into file. (#333) · 6d895da8

guoshzhao authored Apr 01, 2022

**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.

6d895da8

08 Feb, 2022 1 commit
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
07 Feb, 2022 1 commit

Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055

Ziyue Yang authored Feb 07, 2022

**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.

85389055

21 Jan, 2022 1 commit

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285) · 74421ffe

Ziyue Yang authored Jan 21, 2022

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

74421ffe

09 Dec, 2021 1 commit
- Benchmarks: Unify metric names of benchmarks (#252) · 9f56b219
  Yuting Jiang authored Dec 09, 2021
```
**Description**
Unify metric names of benchmarks.
```
  9f56b219
30 Oct, 2021 1 commit

Benchmarks: Add Feature - Add CPU-initiated copy and dtod support to gpu-sm-copy benchmark (#230) · 008e0fe1

Ziyue Yang authored Oct 30, 2021

**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.

008e0fe1

30 Aug, 2021 1 commit
- Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169) · b97197f0
  Ziyue Yang authored Aug 30, 2021
```
**Description**
This commit adds gpu_sm_copy benchmark and related tests.
```
  b97197f0
27 Aug, 2021 1 commit

Benchmarks: Code Revision - Rename kernel_launch_overhead metrics (#171) · 35114bae

guoshzhao authored Aug 28, 2021

**Description**
Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.

35114bae

19 May, 2021 1 commit
- Benchmarks: Add Benchmark - Add kernel launch overhead benchmark. (#74) · e977bbc1
  guoshzhao authored May 19, 2021
```
* add kernel launch overhead benchmark.
```
  e977bbc1