Commits · 47d4a79d5868a7173fc580a55e16c8486a6ce32f · tsoc / superbenchmark

18 Apr, 2026 3 commits

Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d

one authored Apr 18, 2026



Adds opt-in deterministic training mode to SuperBench's PyTorch model
benchmarks. When enabled --enable-determinism. PyTorch deterministic
algorithms are enforced, and per-step numerical fingerprints (loss,
activation means) are recorded as metrics. These can be compared across
runs using the existing sb result diagnosis pipeline to verify bit-exact
reproducibility — useful for hardware validation and platform
comparison.
 
Flags added - 

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to
be recorded
--deterministic-seed

Changes - 

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything
works as expected.

Usage - 

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb
result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file on a different
machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the
- sb result diagnosis command

Note - 
1. Make sure all the parameters are constant between the 2 runs 
2. Running the diagnosis command requires the rules.yaml file

---------
Co-authored-by: Aishwarya Tonpe <aishwarya.tonpe25@gmail.com>
Co-authored-by: Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>

47d4a79d

Format python code · 8c28b69a
one authored Apr 18, 2026

8c28b69a
Runner: validate MPI bind-to option and cover configurable bind-to in tests · 655519cb
one authored Apr 18, 2026

655519cb

17 Apr, 2026 4 commits
- Improve launch bounds for gpu-copy · eea26d0d
  one authored Apr 17, 2026
  
  eea26d0d
- Update ansible playbooks to suppress warnings · 2ea51c1d
  one authored Apr 17, 2026
```
- Get ansible_user_dir from facts
- Get hostname from facts
- Update NODE_RANK expression
```
  2ea51c1d
- Merge pull request #1 from alephpiece/one/deploy-docs · ad7ae5c4
  one authored Apr 17, 2026
```
Configure GitHub Pages
```
  ad7ae5c4
- Add --container-name for custom docker container name · e1d791d2
  one authored Apr 17, 2026
  
  e1d791d2
15 Apr, 2026 1 commit
- Update GPU vendors · f57d86f4
  one authored Apr 15, 2026
  
  f57d86f4
02 Apr, 2026 9 commits
- Add bw1000 config files (beta) · 49a4389b
  one authored Apr 02, 2026
  
  49a4389b
- Update docker volumes in deploy.yaml · 53e0e494
  one authored Apr 02, 2026
  
  53e0e494
- Update dtk platform detection · 42bc5b87
  one authored Apr 02, 2026
  
  42bc5b87
- Add dtk dockerfile for docker 18 · 4599cd69
  one authored Apr 02, 2026
  
  4599cd69
- Update docs · b8b080e2
  one authored Apr 02, 2026
  
  b8b080e2
- Re-implement kernel launch · 04564997
  one authored Apr 02, 2026
  
  04564997
- Fix runner test · 05cdf5d6
  one authored Apr 02, 2026
  
  05cdf5d6
- Use env file in docker instead of /tmp · c1bc12ce
  one authored Apr 02, 2026
  
  c1bc12ce
- Add topo mapping for dtk26.04 · c128dabb
  one authored Apr 02, 2026
  
  c128dabb
01 Apr, 2026 7 commits
- Update rocHPCG metrics · e514815d
  one authored Apr 01, 2026
  
  e514815d
- Add metric sorters for RCCL tests and rocHPCG · 05e137be
  one authored Apr 01, 2026
  
  05e137be
- Fix rocHPCG metric extraction · 742f203d
  one authored Apr 01, 2026
  
  742f203d
- Convert rochpcg script patch into shell script · b623c7e9
  one authored Apr 01, 2026
  
  b623c7e9
- Refactor environment variable handling in runner.py · a10c3e15
  one authored Apr 01, 2026
  
  a10c3e15
- Update dtk dockerfile to use venv · 325db60e
  one authored Apr 01, 2026
  
  325db60e
- Add gpu-hpcg metrics · 2056d7fa
  one authored Apr 01, 2026
  
  2056d7fa
31 Mar, 2026 1 commit
- Update dtk docker image · 4f69c7de
  one authored Mar 31, 2026
  
  4f69c7de
27 Mar, 2026 1 commit
- MicroBenchmark: rocHPCG · e4c2bd4c
  one authored Mar 27, 2026
  
  e4c2bd4c
25 Mar, 2026 1 commit
- Improve DTK gemm-flops · 211e63c7
  one authored Mar 25, 2026
  
  211e63c7
20 Mar, 2026 1 commit
- Fix paths, deps, envs in dockerfile · df0bde6c
  one authored Mar 20, 2026
  
  df0bde6c
19 Mar, 2026 3 commits

Migrate gpu-stream to BabelStream v5.0 · d4051602
one authored Mar 19, 2026

d4051602

Enhance DTK platform support and GPU detection · 1a57f2d6

one authored Mar 19, 2026

- Added Platform.DTK in the microbenchmark framework.
- Introduced new DTK hipblaslt benchmark class and corresponding tests.
- Updated Dockerfile to include hipblaslt-bench and its permissions.
- Registered DTK benchmarks in the benchmark registry for various performance tests.
- Enhanced GPU detection logic to recognize HYGON GPUs.

This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.

1a57f2d6

Update DTK dockerfile and microbenchmarks · c4f39919

one authored Mar 19, 2026

- Update rocm_commom.cmake for CMake>=3.24
- Prevent isolation build
- Add BabelStream as a submodule
- Update dockerignore

c4f39919

17 Mar, 2026 1 commit
- Add a dtk dockerfile · 0fdfe4c3
  one authored Mar 17, 2026
  
  0fdfe4c3
11 Mar, 2026 1 commit

Microbenchmark: upgrade Intel MLC to v3.12 in rocm5.0.x (#784) · 6b8e8104

Hongtao Zhang authored Mar 10, 2026



## Summary
- Upgrade Intel Memory Latency Checker from v3.11 to v3.12 in
rocm5.0.x.dockerfile
- Aligns with other dockerfiles that already use v3.12
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

6b8e8104

04 Feb, 2026 1 commit

Submodule Update: update gpu-burn to newest version (#761) · 575859be

WenqingLan1 authored Feb 03, 2026



Updated 3rd party submodule gpu-burn to newest version for
implementation & doc support for cuda13.0.
Co-authored-by: guoshzhao <guzhao@microsoft.com>

575859be

28 Jan, 2026 1 commit

CI/CD - Fix Image build for cuda11.1.1 (#771) · 8b805d90

Hongtao Zhang authored Jan 28, 2026



**Description**

- When building the CUDA 11.1.1 image, pip (Python 3.8) cannot find a
pre-built wheel for the latest wandb release (v0.23.1). As a result, pip
attempts to build wandb from source. However, the build fails because
the image does not have Go installed, which is required for building
wandb from source. Then the error appears.

**Solution**

- For the CUDA 11.1.1 build, install the required build tools (e.g., Go,
Rust, and Cargo) needed for wandb.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

8b805d90

21 Dec, 2025 1 commit

CI/CD - Fix Azure pipeline (#767) · c99380b4

Hongtao Zhang authored Dec 20, 2025



**Description**
Azure pipeline cpu-unit-test failed for "2025-12-10T03:47:59.0628597Z
ERROR: Could not install packages due to an OSError: [Errno 28] No space
left on device"

**Root Cause**
This happens because the matrix jobs (Python 3.7, 3.10, 3.12) run in
parallel and share the same VM's disk. Python 3.12 downloads
newer/larger packages (especially PyTorch and NVIDIA CUDA libraries
which are ~3GB+), and when multiple jobs run simultaneously, they
exhaust the disk space.

**Fix**
Disable the cache usage when installing SB
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

c99380b4

04 Dec, 2025 1 commit

Bug fix - update IB_DEVICES specification logic to fix ib-loopback test regression (#762) · e3fd943a

Henry Li authored Dec 03, 2025

**Description**

The ib-loopback test was regressed due to this recent
[change](https://github.com/microsoft/superbenchmark/commit/c65ae56713d6bfcc4a3be37d7fe24779590f9791).
When running ib-loopback using the standard
[config](https://github.com/microsoft/superbenchmark/blob/c65ae56713d6bfcc4a3be37d7fe24779590f9791/superbench/config/default.yaml#L69

),
the test would fail since it would pass numeric values like `0` into the
test command which would break since it is not a valid IB device name.

Example failure:

```
 [2025-11-25 22:08:38,100 vmssnc6ec000003:141056][micro_base.py:200][INFO] Execute command - round: 0, benchmark: ib-loopback, command: /usr/local/bin/run_perftest_loopback 47 45 /usr/local/b                                                                                                                                                        in/ib_write_bw -s 8388608 -F --iters=20000 -d 0 -p 45617 -x 0 --report_gbits.
[0]: IB device 0 not found
 Unable to find the Infiniband/RoCE device
IB device 0 not found
 Unable to find the Infiniband/RoCE device
[2025-11-25 22:08:39,113 vmssnc6ec000003:141056][micro_base.py:209][ERROR] Microbenchmark execution failed - round: 0, benchmark: ib-loopback, error message: IB device 0 not found
 Unable to find the Infiniband/RoCE device
IB device 0 not found
 Unable to find the Infiniband/RoCE device
```


**Major Revision**
- Major Revision A
- Major Revision B
- ...

**Minor Revision**
- Minor Revision A
- Minor Revision B
- ...

---------
Co-authored-by: Henry Li <lihl@microsoft.com>

e3fd943a

17 Nov, 2025 1 commit

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567

Yuting Jiang authored Nov 17, 2025

Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)

**Description**
add --set_ib_devices option to auto-select IB device by MPI local rank 


**Major Revision**
- Add a new CLI flag --set_ib_devices to automatically select irregular
IB devices based on the MPI local rank.
- When enabled, the benchmark queries available IB devices via
network.get_ib_devices() and selects the device corresponding to
OMPI_COMM_WORLD_LOCAL_RANK.
- Fall back to existing --ib_dev behavior when the flag is not provided.

**Minor Revision**
- Add an env in network.get_ib_devices() to allow user to set the device
name

c65ae567

06 Nov, 2025 1 commit
- Fix pipelines - Update mlc version in dockerfiles from v3.11 to v3.12 (#752) · 25db1115
  WenqingLan1 authored Nov 06, 2025
```
Updated mlc wget link in dockerfiles.

---------
Co-authored-by: guoshzhao <guzhao@microsoft.com>
```
  25db1115
05 Nov, 2025 1 commit

CI/CD - Fix Azure test pipeline (#754) · 1b4377fc

Hongtao Zhang authored Nov 04, 2025

Python3.10 verification pipeline failed for conflict 'setuptools'
version as below.
<img width="1157" height="622" alt="image"
src="https://github.com/user-attachments/assets/ba0f6045-4b92-4fd8-b92f-1c474725534c

"
/>

Root Cause:
The problem is that modern pip (25.3) uses an isolated build environment
with the latest setuptools by default. The pipeline installs setuptools
65.7 in the user environment, but pip builds the package in an isolated
environment with newer setuptools, which conflicts with the version
check in [setup.py].

Solution:
Remove pip upgrade.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>

1b4377fc