Commits · d4051602a38389d44965bd0bb44a047c56ef9587 · tsoc / superbenchmark

19 Mar, 2026 1 commit
- Migrate gpu-stream to BabelStream v5.0 · d4051602
  one authored Mar 19, 2026
  
  d4051602
20 Jun, 2025 2 commits

Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256

Babak Hejazi authored Jun 20, 2025

**Description**
Enable autotuning as an opt-in mode when benchmarking cublasLt via
`cublaslt_gemm`

The implementation is based on
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu

The behavior of original benchmark command remains unchanged, e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`

The new opt-in options are `-a` (for autotune) and `-I` (for autotune
iterations, default is 50, same as the default for `-i`) and `-W` (for
autotune warmups, default=20, same as the default for `-w`), e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
-a`
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
-I 10 -W 10`

**Note:** This PR also changes the default `gemm_compute_type` for BF16
and FP16 to `CUBLAS_COMPUTE_32F`.

**Further observations:** 
1. The support matrix of the `cublaslt_gemm` could be further extended
in the future to support non-FP16 output as well for FP8 inputs.
2. Currently, the input matrices are initialized with values of 1.0 and
2.0 which makes them less demanding in terms of power. Another future
extension could be to enable another fill mode for, say, uniform random
numbers between -1 and 1.
3. cuBLAS workspace recommendations are listed under
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace



Update (June 10, 2025): verified using higher level test driver with
these commands:

1. inline:
```
python3 -c "                                                                            
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger

parameters = (
    '--num_warmup 10 --num_steps 50 '
    '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
    '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
)
context = BenchmarkRegistry.create_benchmark_context(
    'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
)
benchmark = BenchmarkRegistry.launch_benchmark(context)
logger.info('Result: {}'.format(benchmark.result))
"
```

2. newly added script: 
`python3 examples/benchmarks/cublaslt_function.py`

---------
Co-authored-by: Babak Hejazi <babakh@nvidia.com>

60b13256

Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4

WenqingLan1 authored Jun 19, 2025



**Description**
Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
Stream supports dual socket benchmarking.

Example config for this arch support:
```yaml
    cpu-stream:numa0:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0
        cores: 0 1 2 3 4 5 6 7 8
    cpu-stream:numa1:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 1
        cores: 64 65 66 67 68 69 70 71 72
    cpu-stream:numa-spread:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0 1
        cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
```

---------
Co-authored-by: dpower4 <dilipreddi@gmail.com>

0b8d1fd4

18 Jun, 2025 1 commit

Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a

WenqingLan1 authored Jun 18, 2025

Added GPU Stream benchmark - measures the GPU memory bandwidth and
efficiency for double datatype through various memory operations
including copy, scale, add, and triad.
- added documentation for `gpu-stream` detailing its introduction,
metrics, and descriptions.
- added unit tests for `gpu-stream`. Example output is in
`superbenchmark/tests/data/gpu_stream.log`.

4eddd50a

05 Feb, 2025 1 commit

Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647

Hongtao Zhang authored Feb 05, 2025



**Description**

1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
in nvbandwidth cmd output.
2. Replaced the input format of test cases with a list.
3. Add nvbandwidth configuration example in default config files.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

45d06647

28 Nov, 2024 1 commit

Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1

pdr authored Nov 27, 2024

Added llama benchmark - training and inference in accordance with the
existing pytorch models implementation like gpt2, lstm etc.

- added llama fp8 unit test for better code coverage, to reduce memory
required
- updated transformers version >= 4.28.0 for LLamaConfig
- set tokenizers version <= 0.20.3 to avoid 0.20.4 version
[issues](https://github.com/huggingface/tokenizers/issues/1691

) with
py3.8
- added llama2 to tensorrt
- llama2 tests not added to test_tensorrt_inference_performance.py due
to large memory requirement for worker gpu. tests validated separately
on gh200

---------
Co-authored-by: dpatlolla <dpatlolla@microsoft.com>

249e21c1

22 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669) · 7cef624e

Hongtao Zhang authored Nov 21, 2024



**Description**

Add nvbandwidth benchmark.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

7cef624e

08 Dec, 2023 1 commit

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support... · 4fa60be7

Ziyue Yang authored Dec 08, 2023

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)

**Description**
Add one-to-all, all-to-one, all-to-all support to
gpu_copy_bw_performance, and fix performance bug in gpu_copy

4fa60be7

24 Mar, 2023 1 commit

Benchmarks - Add distributed inference benchmark (#493) · 8daef211

Ziyue Yang authored Mar 24, 2023



**Description**
This PR adds a micro-benchmark of distributed model inference workloads.

**Major Revision**
- Add a new micro-benchmark dist-inference.
- Add corresponding example and unit tests.
- Update configuration files to include this new micro-benchmark.
- Update micro-benchmark README.

---------
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

8daef211

21 Mar, 2023 1 commit

Adding HPL benchmark (#482) · 655bd0aa

rafsalas19 authored Mar 21, 2023



**Description**

- Adding HPL benchmark

---------
Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

655bd0aa

13 Feb, 2023 1 commit

Adding Stream Benchmark (#473) · 32896ca4

rafsalas19 authored Feb 13, 2023



**Description**

- Added stream benchmark
- Added stream unit test
- Added stream example
- Modified docker files to build stream

---------
Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Co-authored-by: Yifan Xiong <xiongyf@yandex.com>

32896ca4

11 Apr, 2022 1 commit

Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338) · 80dcc8aa

guoshzhao authored Apr 11, 2022

**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh

80dcc8aa

16 Mar, 2022 1 commit

Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce

rafsalas19 authored Mar 16, 2022

**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn

ff51a3ce

08 Feb, 2022 1 commit
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
21 Jan, 2022 1 commit

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285) · 74421ffe

Ziyue Yang authored Jan 21, 2022

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

74421ffe

13 Dec, 2021 1 commit

Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216) · b590409e

Hossein Pourreza authored Dec 12, 2021

**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files

b590409e

10 Dec, 2021 1 commit

Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245) · 4d85630a

guoshzhao authored Dec 10, 2021

**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.

4d85630a

25 Nov, 2021 1 commit
- Benchmarks: Fix Typo - Fix typo in description of kernel launch overhead example (#244) · 6d85b03a
  Kaiyu Xie authored Nov 25, 2021
```
**Description**
Fix typo in description of kernel_launch_overhead.py
```
  6d85b03a
12 Nov, 2021 1 commit

Benchmarks - Add TensorRT inference benchmark (#236) · 8a00c8a0

Yifan Xiong authored Nov 12, 2021

__Description__

Add TensorRT inference benchmark for torchvision models.

__Major Revision__
- Measure TensorRT inference performance.

8a00c8a0

09 Nov, 2021 1 commit

Benchmarks: Add Benchmark - Add ib traffic validation distributed benchmark (#215) · 54919424

Yuting Jiang authored Nov 10, 2021

**Description**
Add ib traffic validation distributed benchmark.

**Major Revision**
- Add ib traffic validation distributed benchmark, example and test

54919424

30 Oct, 2021 1 commit

Benchmarks: Add Feature - Add CPU-initiated copy and dtod support to gpu-sm-copy benchmark (#230) · 008e0fe1

Ziyue Yang authored Oct 30, 2021

**Description**
This commit does the following:
1) Adds CPU-initiated copy benchmark;
2) Adds dtod benchmark;
3) Support scanning NUMA nodes and GPUs inside the benchmark program;
4) Change the name of gpu-sm-copy to gpu-copy.

008e0fe1

27 Oct, 2021 1 commit
- Benchmarks: Add Benchmark - Add onnx model benchmarks based on docker image. (#227) · e98a6812
  guoshzhao authored Oct 27, 2021
```
Add RocmOnnxModelBenchmark class to run benchmarks packaged in superbench/benchmark:rocm4.3.1-onnxruntime1.9.0
```
  e98a6812
22 Oct, 2021 1 commit

Benchmarks: Add Benchmark - Add gpcnet microbenchmark (#229) · 6003f2c2

Yuting Jiang authored Oct 22, 2021

**Description**
Add gpcnet microbenchmark

**Major Revision**
- add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test
- add related test and example file

6003f2c2

12 Oct, 2021 1 commit

Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217) · 49cc8f9a

Yuting Jiang authored Oct 13, 2021

**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.

**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example

49cc8f9a

30 Aug, 2021 2 commits
- Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169) · b97197f0
  Ziyue Yang authored Aug 30, 2021
```
**Description**
This commit adds gpu_sm_copy benchmark and related tests.
```
  b97197f0
- Benchmarks: Add Benchmark - Add gemm flops microbenchmark for amd (#152) · f3d53c3d
  Yuting Jiang authored Aug 30, 2021
```
**Description**
Add gemm flops microbenchmark for amd.

**Major Revision**
- Add gemm flops microbenchmark for amd.
- Add related example and test file.
```
  f3d53c3d
27 Aug, 2021 1 commit

Benchmarks: Add Benchmark - Add memory bus bandwidth performance microbenchmark for amd (#153) · 666e3a94

Yuting Jiang authored Aug 27, 2021

**Description**
Add memory bus bandwidth performance microbenchmark for amd.

**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.

666e3a94

30 Jul, 2021 1 commit
- Benchmarks: Add Benchmark - Revise and add rccl microbenchmark for rocm (#143) · 157b4e2d
  Yuting Jiang authored Jul 30, 2021
```
**Description**
Add rccl bandwidth microbenchmark for rocm.

**Major Revision**
- Register rccl-bw benchmark.
```
  157b4e2d
26 Jul, 2021 1 commit

Benchmarks: Add Benchmark - Add NCCL performance benchmark (#113) · e083a598

Yuting Jiang authored Jul 26, 2021

**Description**
Add NCCL performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for NCCL

e083a598

23 Jul, 2021 2 commits

Benchmarks: Add Benchmark - Add IB Loopback performance benchmark. (#112) · b0c5addc

Yuting Jiang authored Jul 24, 2021

**Description**
Add RDMA Loopback performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for RDMA Loopback

b0c5addc

Benchmarks: Add Benchmark - Add disk performance benchmark (#132) · db297fb4

Ziyue Yang authored Jul 23, 2021

**Description**
Add disk performance microbenchmark.

**Major Revision**
- Add microbenchmark, example, test, config for disk performance.

**Minor Revision**
- Fix bugs in executor unit test related to default enabled tests.

db297fb4

13 Jul, 2021 1 commit

Benchmarks: Add Benchmark - Add memory bandwidth benchmark for cuda. (#114) · f9550bd6

Yuting Jiang authored Jul 13, 2021

Add microbenchmark, example, test, config for cuda memory performance and Add cuda-samples(tag with cuda version) as git submodule and update related makefile

f9550bd6

02 Jun, 2021 1 commit
- Benchmarks: Add Benchmark - Add FLOPs performance benchmark for cuda. (#87) · 6c6f5269
  guoshzhao authored Jun 02, 2021
```
* add cuda flops performance benchmark.
```
  6c6f5269
01 Jun, 2021 1 commit
- Benchmarks: Add benchmark - add micro benchmark for cudnn test (#89) · 83235433
  Yuting Jiang authored Jun 01, 2021
```
* add python related cudnn microbenchmark
```
  83235433
31 May, 2021 1 commit

Benchmarks: Add benchmark - add micro benchmark for cublas test (#80) · 18398fba

Yuting Jiang authored May 31, 2021



* add benchmark for cublas test

* format

* revise error handling and test

* add interface to read json file, revise json file path and include .json in packaging

* add random_seed in arguments

* revise preprocess of cublas benchmark

* fix lint error and note error in source code

* update according comments

* revise input arguments from json file to custom str and convert json file to built-in dict list

* restore package config

* fit lint issue

* update platform and comments

* rename files to match source code dir and fix comments error
Co-authored-by: root <root@sb-validation-000001.51z1chmys5fuzfqyo4niepozre.bx.internal.cloudapp.net>

18398fba

19 May, 2021 2 commits
- Benchmarks: Add Benchmark - Add kernel launch overhead benchmark. (#74) · e977bbc1
  guoshzhao authored May 19, 2021
```
* add kernel launch overhead benchmark.
```
  e977bbc1
- expose interface of pin memory and modify cnn configuration (#75) · b7d0ee32
  Yuting Jiang authored May 19, 2021
  
  b7d0ee32
26 Apr, 2021 1 commit
- Benchmarks: Code Revision - Revise the settings of CNN example models. (#65) · 65292ae5
  guoshzhao authored Apr 26, 2021
```
* revise example settings of cnn models.
```
  65292ae5
20 Apr, 2021 2 commits
- Benchmarks: Add Benchmark - Add LSTM model benchmarks. (#60) · 2a7ab691
  guoshzhao authored Apr 20, 2021
```
* Benchmarks: Add Benchmark - Add LSTM model benchmarks.
```
  2a7ab691
- Benchmarks: Add Benchmark - Add CNN model benchmarks. (#59) · 902ea211
  guoshzhao authored Apr 20, 2021
```
* Benchmarks: Add Benchmark - Add CNN model benchmarks.
```
  902ea211