Commits · 47d4a79d5868a7173fc580a55e16c8486a6ce32f · tsoc / superbenchmark

18 Apr, 2026 1 commit

Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d

one authored Apr 18, 2026



Adds opt-in deterministic training mode to SuperBench's PyTorch model
benchmarks. When enabled --enable-determinism. PyTorch deterministic
algorithms are enforced, and per-step numerical fingerprints (loss,
activation means) are recorded as metrics. These can be compared across
runs using the existing sb result diagnosis pipeline to verify bit-exact
reproducibility — useful for hardware validation and platform
comparison.
 
Flags added - 

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to
be recorded
--deterministic-seed

Changes - 

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything
works as expected.

Usage - 

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb
result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file on a different
machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the
- sb result diagnosis command

Note - 
1. Make sure all the parameters are constant between the 2 runs 
2. Running the diagnosis command requires the rules.yaml file

---------
Co-authored-by: Aishwarya Tonpe <aishwarya.tonpe25@gmail.com>
Co-authored-by: Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>

47d4a79d

01 Apr, 2026 1 commit
- Add metric sorters for RCCL tests and rocHPCG · 05e137be
  one authored Apr 01, 2026
  
  05e137be
28 Jan, 2026 1 commit

CI/CD - Fix Image build for cuda11.1.1 (#771) · 8b805d90

Hongtao Zhang authored Jan 28, 2026



**Description**

- When building the CUDA 11.1.1 image, pip (Python 3.8) cannot find a
pre-built wheel for the latest wandb release (v0.23.1). As a result, pip
attempts to build wandb from source. However, the build fails because
the image does not have Go installed, which is required for building
wandb from source. Then the error appears.

**Solution**

- For the CUDA 11.1.1 build, install the required build tools (e.g., Go,
Rust, and Cargo) needed for wandb.

---------
Co-authored-by: Hongtao Zhang <hongtaozhang@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

8b805d90

04 Mar, 2025 1 commit

Analyzer - Enhance logging information for diagnosis rule op baseline errors. (#689) · 64edc9c5

Jorge Esguerra authored Mar 04, 2025

Improves logging info for diagnosis rule op baseline errors. This allows
developers to easily detect errors in their rule files as well as
baseline files, improving end-user experience.

64edc9c5

16 Aug, 2024 1 commit

Bug Fix: Data Diagnosis - Fix bug of failure test and warning of pandas in data diagnosis (#638) · 7af75df3

Yuting Jiang authored Aug 16, 2024

**Description**
Fix bug of failure test and warning of pandas in data diagnosis.

**Major Revision**
- fix warning of pandas in replace and fillna due to type downcast
- fix bug of failure check function only check one matched metric rather
than all matched metrics
- fix bug when converting regex into str of metrics when there're more
than one match group

7af75df3

22 Nov, 2023 1 commit

Analyzer - Generate baseline given results from multiple nodes. (#575) · 9f4880cb

guoshzhao authored Nov 22, 2023



**Description**
Generate baseline given results from multiple nodes. 

**Major Revision**
- Add sub command `sb result generate-baseline`
- Add UT and docs

---------
Co-authored-by: 454314380 <454314380@qq.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

9f4880cb

14 Apr, 2023 1 commit

Release - SuperBench v0.8.0 (#517) · 51761b3a

Yifan Xiong authored Apr 14, 2023



**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

51761b3a

03 Jan, 2023 1 commit

Benchmarks: Micro benchmarks - add source code of correctness check for cublas functions (#450) · 678b1251

Yuting Jiang authored Jan 03, 2023

**Description**
Add c source code of correctness check for cublas functions.

**Major Revision**
- add correctness check for all supported cublas functions
- add --correctness option into binary

**Minor Revision**
- fix bug and template fill_data and prepare_tensor to get right memory-alignment output matrix for different datatype

678b1251

06 Sep, 2022 1 commit

Release - SuperBench v0.6.0 (#409) · 63e9b2d1

Yifan Xiong authored Sep 06, 2022



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

63e9b2d1

23 Aug, 2022 1 commit

Analyzer - Add support to store values of metrics in data diagnosis (#392) · 733860d7

Yuting Jiang authored Aug 23, 2022

**Description**
Add support to store values of metrics in data diagnosis.

Take the following rules as example: 
```
    nccl_store_rule:
      categories: NCCL_DIS
      store: True
      metrics:
        - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
        - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
    nccl_rule:
      function: multi_rules
      criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
      categories: NCCL_DIS
```
**nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria

733860d7

22 Aug, 2022 1 commit

Analyzer - Add support for both jsonl and json format in data diagnosis (#388) · 10a79c4e

Yuting Jiang authored Aug 22, 2022

**Description**
Add support for both jsonl and json format in data diagnosis.

**Major Revision**
- Add support for both jsonl and json format in data diagnosis


**Minor Revision**
- change related doc
- add jsonl support in cli

10a79c4e

09 Aug, 2022 1 commit

Analyzer: Rename fields in json of data diagnosis to be more readable (#382) · b5c7c85d

Yuting Jiang authored Aug 09, 2022

**Description**
Rename field in data diagnosis to be more readable.

**Major Revision**
- rename fields according to diagnosis/metric format

**Minor Revision**
- change type of diagnosis/issue_num to be int

b5c7c85d

01 Aug, 2022 1 commit

Analyzer - Add failure check feature in data diagnosis (#378) · ec16d425

Yuting Jiang authored Aug 01, 2022

**Description**
Add failure check feature in data diagnosis.

**Major Revision**
- Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
- Split performance issue and failedtest in categories


**Minor Revision**
- replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas

ec16d425

01 Jun, 2022 1 commit

Analyzer - Fix bugs in data diagnosis (#355) · 54da021b

user4543 authored Jun 01, 2022

**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel

54da021b

29 Apr, 2022 1 commit

Release - SuperBench v0.5.0 (#350) · 6681c720

Yifan Xiong authored Apr 29, 2022



**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

6681c720

10 Apr, 2022 1 commit
- Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336) · 55b0f9d2
  user4543 authored Apr 10, 2022
```
**Description**
Output results of all nodes in data diagnosis.
```
  55b0f9d2
24 Mar, 2022 1 commit

Analyzer: Add feature - Add result summary in excel,md,html format (#320) · 84fed1ce

user4543 authored Mar 24, 2022

**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.

84fed1ce

15 Mar, 2022 1 commit

Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18

user4543 authored Mar 15, 2022

**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis

b3c95f18

07 Mar, 2022 1 commit
- Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321) · 1ec055e1
  user4543 authored Mar 07, 2022
```
**Description**
Abstract RuleBase from DataDiagnosis.
```
  1ec055e1
20 Feb, 2022 1 commit

Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289) · 97ed12f9

user4543 authored Feb 20, 2022

**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests

97ed12f9

30 Dec, 2021 1 commit

Release - SuperBench v0.4.0 (#278) · ff563b66

Yifan Xiong authored Dec 30, 2021



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

ff563b66

10 Dec, 2021 1 commit

Analyzer: Add Feature - Add basic analysis features (#248) · c2f942cb

Yuting Jiang authored Dec 10, 2021

**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric

c2f942cb

08 Dec, 2021 1 commit

Analyzer: Initialization - Add baseline-based data diagnosis module (#242) · c13ed2a2

Yuting Jiang authored Dec 08, 2021

**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators

c13ed2a2