Commits · a10c3e150b0ebec207485936f48714a2fc2b1948 · tsoc / superbenchmark

01 Apr, 2026 3 commits
- Refactor environment variable handling in runner.py · a10c3e15
  one authored Apr 01, 2026
  
  a10c3e15
- Update dtk dockerfile to use venv · 325db60e
  one authored Apr 01, 2026
  
  325db60e
- Add gpu-hpcg metrics · 2056d7fa
  one authored Apr 01, 2026
  
  2056d7fa
31 Mar, 2026 1 commit
- Update dtk docker image · 4f69c7de
  one authored Mar 31, 2026
  
  4f69c7de
27 Mar, 2026 1 commit
- MicroBenchmark: rocHPCG · e4c2bd4c
  one authored Mar 27, 2026
  
  e4c2bd4c
25 Mar, 2026 1 commit
- Improve DTK gemm-flops · 211e63c7
  one authored Mar 25, 2026
  
  211e63c7
20 Mar, 2026 1 commit
- Fix paths, deps, envs in dockerfile · df0bde6c
  one authored Mar 20, 2026
  
  df0bde6c
19 Mar, 2026 3 commits

Migrate gpu-stream to BabelStream v5.0 · d4051602
one authored Mar 19, 2026

d4051602

Enhance DTK platform support and GPU detection · 1a57f2d6

one authored Mar 19, 2026

- Added Platform.DTK in the microbenchmark framework.
- Introduced new DTK hipblaslt benchmark class and corresponding tests.
- Updated Dockerfile to include hipblaslt-bench and its permissions.
- Registered DTK benchmarks in the benchmark registry for various performance tests.
- Enhanced GPU detection logic to recognize HYGON GPUs.

This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.

1a57f2d6

Update DTK dockerfile and microbenchmarks · c4f39919

one authored Mar 19, 2026

- Update rocm_commom.cmake for CMake>=3.24
- Prevent isolation build
- Add BabelStream as a submodule
- Update dockerignore

c4f39919

17 Mar, 2026 1 commit
- Add a dtk dockerfile · 0fdfe4c3
  one authored Mar 17, 2026
  
  0fdfe4c3
11 Mar, 2026 1 commit

Microbenchmark: upgrade Intel MLC to v3.12 in rocm5.0.x (#784) · 6b8e8104

Hongtao Zhang authored Mar 10, 2026



## Summary
- Upgrade Intel Memory Latency Checker from v3.11 to v3.12 in
rocm5.0.x.dockerfile
- Aligns with other dockerfiles that already use v3.12
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

6b8e8104

04 Feb, 2026 1 commit

Submodule Update: update gpu-burn to newest version (#761) · 575859be

WenqingLan1 authored Feb 03, 2026



Updated 3rd party submodule gpu-burn to newest version for
implementation & doc support for cuda13.0.
Co-authored-by: guoshzhao <guzhao@microsoft.com>

575859be

28 Jan, 2026 1 commit

CI/CD - Fix Image build for cuda11.1.1 (#771) · 8b805d90

Hongtao Zhang authored Jan 28, 2026



**Description**

- When building the CUDA 11.1.1 image, pip (Python 3.8) cannot find a
pre-built wheel for the latest wandb release (v0.23.1). As a result, pip
attempts to build wandb from source. However, the build fails because
the image does not have Go installed, which is required for building
wandb from source. Then the error appears.

**Solution**

- For the CUDA 11.1.1 build, install the required build tools (e.g., Go,
Rust, and Cargo) needed for wandb.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

8b805d90

21 Dec, 2025 1 commit

CI/CD - Fix Azure pipeline (#767) · c99380b4

Hongtao Zhang authored Dec 20, 2025



**Description**
Azure pipeline cpu-unit-test failed for "2025-12-10T03:47:59.0628597Z
ERROR: Could not install packages due to an OSError: [Errno 28] No space
left on device"

**Root Cause**
This happens because the matrix jobs (Python 3.7, 3.10, 3.12) run in
parallel and share the same VM's disk. Python 3.12 downloads
newer/larger packages (especially PyTorch and NVIDIA CUDA libraries
which are ~3GB+), and when multiple jobs run simultaneously, they
exhaust the disk space.

**Fix**
Disable the cache usage when installing SB
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

c99380b4

04 Dec, 2025 1 commit

Bug fix - update IB_DEVICES specification logic to fix ib-loopback test regression (#762) · e3fd943a

Henry Li authored Dec 03, 2025

**Description**

The ib-loopback test was regressed due to this recent
[change](https://github.com/microsoft/superbenchmark/commit/c65ae56713d6bfcc4a3be37d7fe24779590f9791).
When running ib-loopback using the standard
[config](https://github.com/microsoft/superbenchmark/blob/c65ae56713d6bfcc4a3be37d7fe24779590f9791/superbench/config/default.yaml#L69

),
the test would fail since it would pass numeric values like `0` into the
test command which would break since it is not a valid IB device name.

Example failure:

```
 [2025-11-25 22:08:38,100 vmssnc6ec000003:141056][micro_base.py:200][INFO] Execute command - round: 0, benchmark: ib-loopback, command: /usr/local/bin/run_perftest_loopback 47 45 /usr/local/b                                                                                                                                                        in/ib_write_bw -s 8388608 -F --iters=20000 -d 0 -p 45617 -x 0 --report_gbits.
[0]: IB device 0 not found
 Unable to find the Infiniband/RoCE device
IB device 0 not found
 Unable to find the Infiniband/RoCE device
[2025-11-25 22:08:39,113 vmssnc6ec000003:141056][micro_base.py:209][ERROR] Microbenchmark execution failed - round: 0, benchmark: ib-loopback, error message: IB device 0 not found
 Unable to find the Infiniband/RoCE device
IB device 0 not found
 Unable to find the Infiniband/RoCE device
```


**Major Revision**
- Major Revision A
- Major Revision B
- ...

**Minor Revision**
- Minor Revision A
- Minor Revision B
- ...

---------
Co-authored-by: Henry Li <lihl@microsoft.com>

e3fd943a

17 Nov, 2025 1 commit

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567

Yuting Jiang authored Nov 17, 2025

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)

**Description**
add --set_ib_devices option to auto-select IB device by MPI local rank 


**Major Revision**
- Add a new CLI flag --set_ib_devices to automatically select irregular
IB devices based on the MPI local rank.
- When enabled, the benchmark queries available IB devices via
network.get_ib_devices() and selects the device corresponding to
OMPI_COMM_WORLD_LOCAL_RANK.
- Fall back to existing --ib_dev behavior when the flag is not provided.

**Minor Revision**
- Add an env in network.get_ib_devices() to allow user to set the device
name

c65ae567

06 Nov, 2025 1 commit
- Fix pipelines - Update mlc version in dockerfiles from v3.11 to v3.12 (#752) · 25db1115
  WenqingLan1 authored Nov 06, 2025
```
Updated mlc wget link in dockerfiles.

---------
Co-authored-by: guoshzhao <guzhao@microsoft.com>
```
  25db1115
05 Nov, 2025 1 commit

CI/CD - Fix Azure test pipeline (#754) · 1b4377fc

Hongtao Zhang authored Nov 04, 2025

Python3.10 verification pipeline failed for conflict 'setuptools'
version as below.
<img width="1157" height="622" alt="image"
src="https://github.com/user-attachments/assets/ba0f6045-4b92-4fd8-b92f-1c474725534c

"
/>

Root Cause:
The problem is that modern pip (25.3) uses an isolated build environment
with the latest setuptools by default. The pipeline installs setuptools
65.7 in the user environment, but pip builds the package in an isolated
environment with newer setuptools, which conflicts with the version
check in [setup.py].

Solution:
Remove pip upgrade.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

1b4377fc

23 Oct, 2025 1 commit

Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98

Yuting Jiang authored Oct 23, 2025

**Description**
This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
cublaslt-gemm micro benchmark, enabling detailed kernel analysis
including DRAM throughput, compute throughput, and launch arguments.

**Major Revision**
- Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
- Modifies command execution to use NCU when profiling is enabled
- Updates result parsing to handle both standard and NCU profiled output
formats

f6e65a98

22 Oct, 2025 2 commits

Benchmarks: Micro benchmark - Support verification and parallel run for disk... · fe234262

Ziyue Yang authored Oct 22, 2025


Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741)

**Description**
Adds verification and parallel run support for disk performance
benchmark.

**Major Revision**
- Adds `--verify` flag to support verify written data.
- Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES`
and `NUMA_NODES` environmental variables.

---------
Co-authored-by: guoshzhao <guzhao@microsoft.com>

fe234262

CI/CD - Fix python3.10 pipeline (#753) · 86a940c1

Hongtao Zhang authored Oct 21, 2025



**Description**
Python3.10 pipeline failed.

**Solution**
From log, 'bc' cmd is missing. Since our image tags are simple, the
solution is to remove 'bc' cmd directly.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

86a940c1

08 Oct, 2025 2 commits

Enhancement: Add nsys and pytorch profiler debug trace support (#744) · d804dbb6

Hongtao Zhang authored Oct 08, 2025



To improve benchmark debugging, the following debug methods were added:

pytorch profiler in model benchmark

- SB_ENABLE_PYTORCH_PROFILER: switch to enable/disable
- SB_TORCH_PROFILER_TRACE_DIR: log path
These 2 runtime variables need to be configured in SB config file.

nsys in SB runner

- SB_ENABLE_NSYS: switch to enable/disable 
- SB_NSYS_TRACE_DIR: log path
These 2 runtime variables need to be configured in runner's ENV

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

d804dbb6

CI/CD - Fix image merge in GitHub Action. (#749) · b9864244
Yifan Xiong authored Oct 07, 2025
```
Fix image merge for release event in GitHub Action.
```
b9864244

01 Oct, 2025 1 commit

Dockerfile - add cuda13.0.dockerfile (#739) · 60189dd6

WenqingLan1 authored Oct 01, 2025



Add support for cuda13.0.
Add cuda13.0.dockerfile.
Add cuda13.0 image building task to github pipeline.
Update GPU STREAM to work with cuda13.0.
Fix data type conversion perf bug in GPU stream.
Update nvbandwidth submodule to be v0.8.
Update perftest submodule to be 4bee61f80d9e268fc97eaf40be00409e91d3a19e
(recent master).

---------
Co-authored-by: Ubuntu <dilipreddi@gmail.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>

60189dd6

30 Sep, 2025 1 commit

Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all... · 93e9d262

Yuting Jiang authored Sep 30, 2025

Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (#736)

**Description**
Add simultanneously all-to-host / host-to-all bandwidth testcases to
nvbandwidth .

**Major Revision**
- nvbandwidth.patch: Add simultanneously all-to-host / host-to-all
bandwidth testcases to nvbandwidth
- upgrade nvbandwidth submodule into v0.8
- add patch into makefile build

93e9d262

29 Sep, 2025 2 commits
- Benchmark: Model benchmark - add option to exclude data copy time in model benchmarks (#734) · 76066b6d
  Yuting Jiang authored Sep 29, 2025
```
**Description**
add option to exclude data copy time in model benchmarks.

**Major Revision**
- add an option --no_copy
- move start time after data copy finish
```
  76066b6d
- Benchmarks: Micro benchmark - Add numa support for nvbandwidth (#742) · ad8e0143
  Yuting Jiang authored Sep 29, 2025
```
**Description**
Add numa support for nvbandwidth.
```
  ad8e0143
19 Sep, 2025 1 commit

Benchmarks: micro benchmarks - change cublasLtMatmulDescCreate scaleType from... · a7c4ed92

Yuting Jiang authored Sep 20, 2025

Benchmarks: micro benchmarks - change cublasLtMatmulDescCreate scaleType  from CUDA_R_32F to CUDA_R_16F in FP16 dist inference  (#732)

**Description**
change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F
in FP16 dist inference to fix cublaslt error.

a7c4ed92

12 Aug, 2025 1 commit

Release - SuperBench v0.12.0 (#729) · 0b4311cd

Hongtao Zhang authored Aug 12, 2025



**Description**

Cherry-pick bug fixes from v0.12.0 to main.

**Major Revisions**

* #725
* #727
* #728
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yixio@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

0b4311cd

30 Jun, 2025 1 commit

Benchmarks: Add Mixture of Experts Model (#679) · 44e35cda

pdr authored Jun 30, 2025



Added MoE model using MixtralConfig. 

1. Added 8x7b and 8x22b variants 
2. Requires high VRAM as all experts are loaded in memory. Thus,
disabled training due to memory constraint on test worker.

---------
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

44e35cda

26 Jun, 2025 1 commit

Benchmarks - Add deepseek megatron-lm benchmark (#713) · deef9a3d

Yuting Jiang authored Jun 27, 2025



**Description**
Add deepseek megatron-lm benchmark.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

deef9a3d

25 Jun, 2025 1 commit

Dockerfile - Add cuda12.9 docker image (#716) · a56356d8

guoshzhao authored Jun 25, 2025



**Description**
Add cuda 12.9 dockerfile and build in pipeline.

---------
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

a56356d8

24 Jun, 2025 1 commit

Benchmarks - Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711) · b795477e

guoshzhao authored Jun 24, 2025



**Description**
Add FP4 precision support for cublaslt_gemm benchmark.

**Major Revision**
- Add new type `fp4e2m1` and `__nv_fp4_e2m1`.
- For FP4 matmul, precision of MatrixC (add) should be FP16, precision
of MatricD (output) should be FP4, otherwise, it will not work.
- Add macro `CUDA_VERSION` to resolve the compatibility issue of
different CUDA versions.

---------
Co-authored-by: Ubuntu <aiperf@aiperf000000.hp5z1gqeinfufbj2u3jcty5fme.cdmx.internal.cloudapp.net>
Co-authored-by: AVA <39534996+avazr@users.noreply.github.com>
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>

b795477e

20 Jun, 2025 2 commits

Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256

Babak Hejazi authored Jun 20, 2025

**Description**
Enable autotuning as an opt-in mode when benchmarking cublasLt via
`cublaslt_gemm`

The implementation is based on
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu

The behavior of original benchmark command remains unchanged, e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`

The new opt-in options are `-a` (for autotune) and `-I` (for autotune
iterations, default is 50, same as the default for `-i`) and `-W` (for
autotune warmups, default=20, same as the default for `-w`), e.g.:
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
-a`
- `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
-I 10 -W 10`

**Note:** This PR also changes the default `gemm_compute_type` for BF16
and FP16 to `CUBLAS_COMPUTE_32F`.

**Further observations:** 
1. The support matrix of the `cublaslt_gemm` could be further extended
in the future to support non-FP16 output as well for FP8 inputs.
2. Currently, the input matrices are initialized with values of 1.0 and
2.0 which makes them less demanding in terms of power. Another future
extension could be to enable another fill mode for, say, uniform random
numbers between -1 and 1.
3. cuBLAS workspace recommendations are listed under
https://docs.nvidia.com/cuda/cublas/#cublassetworkspace



Update (June 10, 2025): verified using higher level test driver with
these commands:

1. inline:
```
python3 -c "                                                                            
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger

parameters = (
    '--num_warmup 10 --num_steps 50 '
    '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
    '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
)
context = BenchmarkRegistry.create_benchmark_context(
    'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
)
benchmark = BenchmarkRegistry.launch_benchmark(context)
logger.info('Result: {}'.format(benchmark.result))
"
```

2. newly added script: 
`python3 examples/benchmarks/cublaslt_function.py`

---------
Co-authored-by: Babak Hejazi <babakh@nvidia.com>

60b13256

Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4

WenqingLan1 authored Jun 19, 2025



**Description**
Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
Stream supports dual socket benchmarking.

Example config for this arch support:
```yaml
    cpu-stream:numa0:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0
        cores: 0 1 2 3 4 5 6 7 8
    cpu-stream:numa1:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 1
        cores: 64 65 66 67 68 69 70 71 72
    cpu-stream:numa-spread:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0 1
        cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
```

---------
Co-authored-by: dpower4 <dilipreddi@gmail.com>

0b8d1fd4

18 Jun, 2025 1 commit

Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a

WenqingLan1 authored Jun 18, 2025

Added GPU Stream benchmark - measures the GPU memory bandwidth and
efficiency for double datatype through various memory operations
including copy, scale, add, and triad.
- added documentation for `gpu-stream` detailing its introduction,
metrics, and descriptions.
- added unit tests for `gpu-stream`. Example output is in
`superbenchmark/tests/data/gpu_stream.log`.

4eddd50a

14 Jun, 2025 1 commit

microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051

Hongtao Zhang authored Jun 14, 2025



In the current implementation, the CPU‑stream benchmark code renames the
binary before the microbench base class can verify its existence,
causing the default‐binary check to fail.

This PR adds a “default” binary—built with the standard compile
parameters—so that the base class can always find and validate it. Once
the default binary is in place, the CPU‑stream code will rename it as
needed and re‑check its presence before running the benchmark.

The PR also enable CPU stream in the default settings.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

991c0051

05 Jun, 2025 1 commit
- Update CODEOWNERS (#718) · 431bf19c
  Yifan Xiong authored Jun 05, 2025
```
Update CODEOWNERS.
```
  431bf19c
01 May, 2025 1 commit
- cuda arch flag for cublaslt (#701) · 3e090482
  pdr authored Apr 30, 2025
```
adding gb200 cuda arch flag for cublaslt compilation
```
  3e090482