Commits · 0993db7574fc4d4e8f95be582511389a8e977086 · tsoc / superbenchmark

21 Apr, 2026 1 commit

Runner: Add local numactl GPU affinity support (#6) · 0993db75

one authored Apr 21, 2026

- Add `numactl` support for local runner modes, including `cpunodebind`, `membind`, and `physcpubind`.
- Add `gpu_affinity` resolution through `sb node topo --get gpu-numa-affinity --gpu-id`.
- Add `sb node topo` support for GPU NUMA topology queries.
- Update BW1000 config to use the new local `numactl` semantics.
- Document the new `numactl` mode fields and limitations.

0993db75

20 Apr, 2026 1 commit
- Update mem-bw to use BandwidthTest (#5) · 800b962a
  one authored Apr 20, 2026
```
* Update mem-bw to use BandwidthTest

* Update config and format code
```
  800b962a
18 Apr, 2026 4 commits

Fix some lint warnings (#3) · b31acf90

one authored Apr 18, 2026

* Fix some lint warnings
* Exclude some paths in cpplint
* Fix some tests and formatting

b31acf90

Format python code on branch dtk · 2bf01d5e
one authored Apr 18, 2026

2bf01d5e

Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d

one authored Apr 18, 2026



Adds opt-in deterministic training mode to SuperBench's PyTorch model
benchmarks. When enabled --enable-determinism. PyTorch deterministic
algorithms are enforced, and per-step numerical fingerprints (loss,
activation means) are recorded as metrics. These can be compared across
runs using the existing sb result diagnosis pipeline to verify bit-exact
reproducibility — useful for hardware validation and platform
comparison.
 
Flags added - 

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to
be recorded
--deterministic-seed

Changes - 

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything
works as expected.

Usage - 

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb
result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file on a different
machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the
- sb result diagnosis command

Note - 
1. Make sure all the parameters are constant between the 2 runs 
2. Running the diagnosis command requires the rules.yaml file

---------
Co-authored-by: Aishwarya Tonpe <aishwarya.tonpe25@gmail.com>
Co-authored-by: Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>

47d4a79d

Runner: validate MPI bind-to option and cover configurable bind-to in tests · 655519cb
one authored Apr 18, 2026

655519cb

17 Apr, 2026 1 commit
- Add --container-name for custom docker container name · e1d791d2
  one authored Apr 17, 2026
  
  e1d791d2
15 Apr, 2026 1 commit
- Update GPU vendors · f57d86f4
  one authored Apr 15, 2026
  
  f57d86f4
02 Apr, 2026 3 commits
- Re-implement kernel launch · 04564997
  one authored Apr 02, 2026
  
  04564997
- Fix runner test · 05cdf5d6
  one authored Apr 02, 2026
  
  05cdf5d6
- Use env file in docker instead of /tmp · c1bc12ce
  one authored Apr 02, 2026
  
  c1bc12ce
01 Apr, 2026 3 commits
- Fix rocHPCG metric extraction · 742f203d
  one authored Apr 01, 2026
  
  742f203d
- Refactor environment variable handling in runner.py · a10c3e15
  one authored Apr 01, 2026
  
  a10c3e15
- Add gpu-hpcg metrics · 2056d7fa
  one authored Apr 01, 2026
  
  2056d7fa
19 Mar, 2026 2 commits

Migrate gpu-stream to BabelStream v5.0 · d4051602
one authored Mar 19, 2026

d4051602

Enhance DTK platform support and GPU detection · 1a57f2d6

one authored Mar 19, 2026

- Added Platform.DTK in the microbenchmark framework.
- Introduced new DTK hipblaslt benchmark class and corresponding tests.
- Updated Dockerfile to include hipblaslt-bench and its permissions.
- Registered DTK benchmarks in the benchmark registry for various performance tests.
- Enhanced GPU detection logic to recognize HYGON GPUs.

This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.

1a57f2d6

17 Nov, 2025 1 commit

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567

Yuting Jiang authored Nov 17, 2025

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)

**Description**
add --set_ib_devices option to auto-select IB device by MPI local rank 


**Major Revision**
- Add a new CLI flag --set_ib_devices to automatically select irregular
IB devices based on the MPI local rank.
- When enabled, the benchmark queries available IB devices via
network.get_ib_devices() and selects the device corresponding to
OMPI_COMM_WORLD_LOCAL_RANK.
- Fall back to existing --ib_dev behavior when the flag is not provided.

**Minor Revision**
- Add an env in network.get_ib_devices() to allow user to set the device
name

c65ae567

23 Oct, 2025 1 commit

Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98

Yuting Jiang authored Oct 23, 2025

**Description**
This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
cublaslt-gemm micro benchmark, enabling detailed kernel analysis
including DRAM throughput, compute throughput, and launch arguments.

**Major Revision**
- Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
- Modifies command execution to use NCU when profiling is enabled
- Updates result parsing to handle both standard and NCU profiled output
formats

f6e65a98

22 Oct, 2025 1 commit

Benchmarks: Micro benchmark - Support verification and parallel run for disk... · fe234262

Ziyue Yang authored Oct 22, 2025


Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741)

**Description**
Adds verification and parallel run support for disk performance
benchmark.

**Major Revision**
- Adds `--verify` flag to support verify written data.
- Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES`
and `NUMA_NODES` environmental variables.

---------
Co-authored-by: guoshzhao <guzhao@microsoft.com>

fe234262

29 Sep, 2025 2 commits
- Benchmark: Model benchmark - add option to exclude data copy time in model benchmarks (#734) · 76066b6d
  Yuting Jiang authored Sep 29, 2025
```
**Description**
add option to exclude data copy time in model benchmarks.

**Major Revision**
- add an option --no_copy
- move start time after data copy finish
```
  76066b6d
- Benchmarks: Micro benchmark - Add numa support for nvbandwidth (#742) · ad8e0143
  Yuting Jiang authored Sep 29, 2025
```
**Description**
Add numa support for nvbandwidth.
```
  ad8e0143
30 Jun, 2025 1 commit

Benchmarks: Add Mixture of Experts Model (#679) · 44e35cda

pdr authored Jun 30, 2025



Added MoE model using MixtralConfig. 

1. Added 8x7b and 8x22b variants 
2. Requires high VRAM as all experts are loaded in memory. Thus,
disabled training due to memory constraint on test worker.

---------
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

44e35cda

26 Jun, 2025 1 commit

Benchmarks - Add deepseek megatron-lm benchmark (#713) · deef9a3d

Yuting Jiang authored Jun 27, 2025



**Description**
Add deepseek megatron-lm benchmark.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

deef9a3d

25 Jun, 2025 1 commit

Dockerfile - Add cuda12.9 docker image (#716) · a56356d8

guoshzhao authored Jun 25, 2025



**Description**
Add cuda 12.9 dockerfile and build in pipeline.

---------
Co-authored-by: Guoshuai Zhao <microsoft@microsoft.com>
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Hongtao Zhang <garyworkzht@gmail.com>

a56356d8

20 Jun, 2025 1 commit

Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4

WenqingLan1 authored Jun 19, 2025



**Description**
Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
Stream supports dual socket benchmarking.

Example config for this arch support:
```yaml
    cpu-stream:numa0:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0
        cores: 0 1 2 3 4 5 6 7 8
    cpu-stream:numa1:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 1
        cores: 64 65 66 67 68 69 70 71 72
    cpu-stream:numa-spread:
      timeout: *default_timeout
      modes:
      - name: local
        parallel: no
      parameters:
        cpu_arch: neo2
        numa_mem_nodes: 0 1
        cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
```

---------
Co-authored-by: dpower4 <dilipreddi@gmail.com>

0b8d1fd4

18 Jun, 2025 1 commit

Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a

WenqingLan1 authored Jun 18, 2025

Added GPU Stream benchmark - measures the GPU memory bandwidth and
efficiency for double datatype through various memory operations
including copy, scale, add, and triad.
- added documentation for `gpu-stream` detailing its introduction,
metrics, and descriptions.
- added unit tests for `gpu-stream`. Example output is in
`superbenchmark/tests/data/gpu_stream.log`.

4eddd50a

14 Jun, 2025 1 commit

microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051

Hongtao Zhang authored Jun 14, 2025



In the current implementation, the CPU‑stream benchmark code renames the
binary before the microbench base class can verify its existence,
causing the default‐binary check to fail.

This PR adds a “default” binary—built with the standard compile
parameters—so that the base class can always find and validate it. Once
the default binary is in place, the CPU‑stream code will rename it as
needed and re‑check its presence before running the benchmark.

The PR also enable CPU stream in the default settings.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

991c0051

04 Mar, 2025 1 commit

Analyzer - Enhance logging information for diagnosis rule op baseline errors. (#689) · 64edc9c5

Jorge Esguerra authored Mar 04, 2025

Improves logging info for diagnosis rule op baseline errors. This allows
developers to easily detect errors in their rule files as well as
baseline files, improving end-user experience.

64edc9c5

15 Feb, 2025 1 commit

Bugfix: Avoid Unintended nvbandwidth Function Calls in All Benchmarks (#685) · 41a484fa

Hongtao Zhang authored Feb 14, 2025



Root Cause:

1. '_get_all_test_cases()' was called in '_parser' while '_parser' was
defined in the base class.
2.  in '_get_all_test_cases()', cmd path was not included.

Fix:

1. Remove '_get_all_test_cases()' from '_parser'.
2. Construct path for cmd.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

41a484fa

05 Feb, 2025 1 commit

Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647

Hongtao Zhang authored Feb 05, 2025



**Description**

1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
in nvbandwidth cmd output.
2. Replaced the input format of test cases with a list.
3. Add nvbandwidth configuration example in default config files.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

45d06647

28 Nov, 2024 1 commit

Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1

pdr authored Nov 27, 2024

Added llama benchmark - training and inference in accordance with the
existing pytorch models implementation like gpt2, lstm etc.

- added llama fp8 unit test for better code coverage, to reduce memory
required
- updated transformers version >= 4.28.0 for LLamaConfig
- set tokenizers version <= 0.20.3 to avoid 0.20.4 version
[issues](https://github.com/huggingface/tokenizers/issues/1691

) with
py3.8
- added llama2 to tensorrt
- llama2 tests not added to test_tensorrt_inference_performance.py due
to large memory requirement for worker gpu. tests validated separately
on gh200

---------
Co-authored-by: dpatlolla <dpatlolla@microsoft.com>

249e21c1

27 Nov, 2024 1 commit

CI/CD - Upgrade dependency versions in pipeline (#671) · 96f5ccea

Yifan Xiong authored Nov 26, 2024



Upgrade dependency versions in Azure pipeline:

* Remove Python 3.6 and add Python 3.10 for cpu-unit-test
* Upgrade CUDA from 11.1 to 12.4 for cuda-unit-test
* Update labels accordingly

---------
Co-authored-by: Dilip Patlolla <dilipreddi@gmail.com>

96f5ccea

22 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add nvbandwidth benchmark (#669) · 7cef624e

Hongtao Zhang authored Nov 21, 2024



**Description**

Add nvbandwidth benchmark.

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

7cef624e

20 Nov, 2024 1 commit

Benchmarks: micro benchmarks - add general CPU bandwidth and latency benchmark (#662) · 9c35e80a

Hongtao Zhang authored Nov 20, 2024



**Description**
Add micro benchmark to measure general CPU bandwidth and latency without 'mlc'.

Test output:
```
{
"cpu-memory-bw-latency/return_code": 0,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_bw": 5388.75021,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_0_1_lat": 0.185571786,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_bw": 4634.82028,
"cpu-memory-bw-latency/mem_bandwidth_matrix_numa_1_0_lat": 0.215758096,
}
```

---------
Co-authored-by: hongtaozhang <hongtaozhang@microsoft.com>

9c35e80a

20 Aug, 2024 1 commit
- Bug: Executor - Fix executor for Benchmark Execution Without Explicit Framework Field (#636) · 96cc4d93
  Yang Wang authored Aug 21, 2024
```
**Description**
Fix executor for Benchmark Execution Without Explicit Framework Field
```
  96cc4d93
23 Jul, 2024 1 commit

Update omegaconf version to 2.3.0 (#631) · 9a3ce39d

Yang Wang authored Jul 24, 2024

Update `omegaconf` version to
[2.3.0](https://pypi.org/project/omegaconf/2.3.0/) as omegaconf 2.0.6
has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will
enforce this behaviour change.
Discussion can be found at https://github.com/pypa/pip/issues/12063.

9a3ce39d

02 Apr, 2024 1 commit
- Benchmarks: Revise Code - Add hipblasLt tuning to dist-inference cpp implementation (#616) · cc89ee59
  Ziyue Yang authored Apr 02, 2024
```
**Description**
Adds hipblasLt tuning to dist-inference cpp implementation.
```
  cc89ee59
08 Jan, 2024 1 commit

Release - SuperBench v0.10.0 (#607) · 2c88db90

Yifan Xiong authored Jan 07, 2024

**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - U...

2c88db90

10 Dec, 2023 1 commit
- Benchmarks: Microbenchmark - Add distributed inference benchmark cpp implementation (#586) · 719a427f
  Ziyue Yang authored Dec 11, 2023
```
**Description**
Add distributed inference benchmark cpp implementation.
```
  719a427f
08 Dec, 2023 1 commit

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support... · 4fa60be7

Ziyue Yang authored Dec 08, 2023

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (#588)

**Description**
Add one-to-all, all-to-one, all-to-all support to
gpu_copy_bw_performance, and fix performance bug in gpu_copy

4fa60be7