- 18 Apr, 2026 7 commits
-
-
one authored
-
one authored
* Fix some lint warnings * Exclude some paths in cpplint * Fix some tests and formatting
-
one authored
-
one authored
-
one authored
Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison. Flags added - --enable-determinism --check-frequency: Number of steps after which you want the metrics to be recorded --deterministic-seed Changes - Updated pytorch_base.py to handle deterministic settings, logging. Added a new example script: pytorch_deterministic_example.py Added a test file: test_pytorch_determinism_all.py to verify everything works as expected. Usage - Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine) Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command Note - 1. Make sure all the parameters are constant between the 2 runs 2. Running the diagnosis command requires the rules.yaml file --------- Co-authored-by:
Aishwarya Tonpe <aishwarya.tonpe25@gmail.com> Co-authored-by:
Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
-
one authored
-
one authored
-
- 17 Apr, 2026 4 commits
- 15 Apr, 2026 1 commit
-
-
one authored
-
- 02 Apr, 2026 9 commits
- 01 Apr, 2026 7 commits
- 31 Mar, 2026 1 commit
-
-
one authored
-
- 27 Mar, 2026 1 commit
-
-
one authored
-
- 25 Mar, 2026 1 commit
-
-
one authored
-
- 20 Mar, 2026 1 commit
-
-
one authored
-
- 19 Mar, 2026 3 commits
-
-
one authored
-
one authored
- Added Platform.DTK in the microbenchmark framework. - Introduced new DTK hipblaslt benchmark class and corresponding tests. - Updated Dockerfile to include hipblaslt-bench and its permissions. - Registered DTK benchmarks in the benchmark registry for various performance tests. - Enhanced GPU detection logic to recognize HYGON GPUs. This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
-
one authored
- Update rocm_commom.cmake for CMake>=3.24 - Prevent isolation build - Add BabelStream as a submodule - Update dockerignore
-
- 17 Mar, 2026 1 commit
-
-
one authored
-
- 11 Mar, 2026 1 commit
-
-
Hongtao Zhang authored
## Summary - Upgrade Intel Memory Latency Checker from v3.11 to v3.12 in rocm5.0.x.dockerfile - Aligns with other dockerfiles that already use v3.12 Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Claude Opus 4.5 <noreply@anthropic.com>
-
- 04 Feb, 2026 1 commit
-
-
WenqingLan1 authored
Updated 3rd party submodule gpu-burn to newest version for implementation & doc support for cuda13.0. Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
- 28 Jan, 2026 1 commit
-
-
Hongtao Zhang authored
**Description** - When building the CUDA 11.1.1 image, pip (Python 3.8) cannot find a pre-built wheel for the latest wandb release (v0.23.1). As a result, pip attempts to build wandb from source. However, the build fails because the image does not have Go installed, which is required for building wandb from source. Then the error appears. **Solution** - For the CUDA 11.1.1 build, install the required build tools (e.g., Go, Rust, and Cargo) needed for wandb. --------- Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com>
-
- 21 Dec, 2025 1 commit
-
-
Hongtao Zhang authored
**Description** Azure pipeline cpu-unit-test failed for "2025-12-10T03:47:59.0628597Z ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device" **Root Cause** This happens because the matrix jobs (Python 3.7, 3.10, 3.12) run in parallel and share the same VM's disk. Python 3.12 downloads newer/larger packages (especially PyTorch and NVIDIA CUDA libraries which are ~3GB+), and when multiple jobs run simultaneously, they exhaust the disk space. **Fix** Disable the cache usage when installing SB Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-