Commits · 036c4712b1256219e26fd9b5740bba8d6d6596c7 · tsoc / superbenchmark

25 Mar, 2026 1 commit

Benchmark: Model benchmark - deterministic training support (#731) · 036c4712

Aishwarya Tonpe authored Mar 25, 2026

Adds opt-in deterministic training mode to SuperBench's PyTorch model
benchmarks. When enabled --enable-determinism. PyTorch deterministic
algorithms are enforced, and per-step numerical fingerprints (loss,
activation means) are recorded as metrics. These can be compared across
runs using the existing sb result diagnosis pipeline to verify bit-exact
reproducibility — useful for hardware validation and platform
comparison.

Flags added -

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to
be recorded
--deterministic-seed

Changes -

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything
works as expected.

Usage -

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb
result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
will be recorded in the results-summary.jsonl file on a different
machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the
- sb result diagnosis command

Note -
1. Make sure all the parameters are constant between the 2 runs
2. Running the diagnosis command requires the rules.yaml file

---------
Co-authored-by: Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>

036c4712

05 Jul, 2023 1 commit
- CI/CD - Support DirectX test pipeline (#545) · 3704a432
  Yuting Jiang authored Jul 05, 2023
```
**Description**
Support DirectX test pipeline.
```
  3704a432
04 Jan, 2023 1 commit

Runner - Generate host groups file in mpi mode (#458) · 8e748d56

Yang Wang authored Jan 04, 2023

**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo

8e748d56

03 Jan, 2023 1 commit
- Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437) · 65e433c0
  Yang Wang authored Jan 03, 2023
```
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
```
  65e433c0
30 Dec, 2022 1 commit

Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445) · 9dfefce3

Yuting Jiang authored Dec 30, 2022

**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs

9dfefce3

29 Dec, 2022 1 commit
- Runner - Support `pair-wise` pattern in `mpi` mode (#447) · 7838b6b1
  Yang Wang authored Dec 29, 2022
```
* Extract pair-wise pattern from ib_validation
```
  7838b6b1
29 Nov, 2022 1 commit

Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430) · e4eeda0a

Yang Wang authored Nov 29, 2022

* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments

e4eeda0a

05 Jul, 2022 1 commit
- CLI - Support SKU auto detect if running on Azure VM (#365) · a94ead34
  Yifan Xiong authored Jul 05, 2022
```
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
```
  a94ead34
15 Nov, 2021 1 commit

Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239) · cc70f9c1

guoshzhao authored Nov 15, 2021

**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```

cc70f9c1