- 21 Apr, 2026 5 commits
-
-
Hongtao Zhang authored
Summary The gpu_stream benchmark has NVIDIA-specific dependencies that prevent it from compiling on ROCm 6.3+. This change makes it CUDA-only, gracefully skipping the build with a warning on non-NVIDIA environments. Problem The gpu_stream benchmark fails to compile on ROCm 6.3+ due to multiple NVIDIA-specific dependencies: 1. nvml.h — NVIDIA Management Library header, used for querying actual memory clock rates. No HIP equivalent. Referenced in gpu_stream.cu and gpu_stream_utils.hpp. 2. cuda.h in headers — Three .hpp files (gpu_stream.hpp, gpu_stream_kernels.hpp, gpu_stream_utils.hpp) directly include <cuda.h> and <cuda_runtime.h>. These headers are not processed by hipify-perl (only .cu source files are), so they fail to resolve on ROCm. 3. Deprecated hipDeviceProp_t struct fields — The code accesses memoryBusWidth, memoryClockRate, and ECCEnabled from the device properties struct. These fields were removed from hipDeviceProp_t in ROCm 6.3, causing compilation errors after hipification. The existing ROCm path was marked as incomplete (# TODO: test for ROC) and was never fully functional on recent ROCm versions. Changes - Removed the non-functional ROCm/HIP build path from gpu_stream/CMakeLists.txt - When CUDA is not found, prints a warning and returns gracefully instead of attempting a broken hipify build or raising FATAL_ERROR - No changes to the NVIDIA/CUDA build path — it continues to work as before Impact - NVIDIA builds: No change — gpu_stream builds and installs normally - ROCm builds: gpu_stream is skipped with a warning message. Previously it would fail the entire make cppbuild step, blocking the Docker image build - Other benchmarks: Unaffected — build.sh continues to the next benchmark after gpu_stream returns Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
one authored
* Update gpu-hpcg metrics to encode process and problem shape * Fix tests
-
one authored
-
one authored
- Add BW150 config - Update BW1000 config - Merge summary rules
-
one authored
- Add `numactl` support for local runner modes, including `cpunodebind`, `membind`, and `physcpubind`. - Add `gpu_affinity` resolution through `sb node topo --get gpu-numa-affinity --gpu-id`. - Add `sb node topo` support for GPU NUMA topology queries. - Update BW1000 config to use the new local `numactl` semantics. - Document the new `numactl` mode fields and limitations.
-
- 20 Apr, 2026 2 commits
- 18 Apr, 2026 5 commits
-
-
one authored
* Fix some lint warnings * Exclude some paths in cpplint * Fix some tests and formatting
-
one authored
-
one authored
Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison. Flags added - --enable-determinism --check-frequency: Number of steps after which you want the metrics to be recorded --deterministic-seed Changes - Updated pytorch_base.py to handle deterministic settings, logging. Added a new example script: pytorch_deterministic_example.py Added a test file: test_pytorch_determinism_all.py to verify everything works as expected. Usage - Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine) Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command Note - 1. Make sure all the parameters are constant between the 2 runs 2. Running the diagnosis command requires the rules.yaml file --------- Co-authored-by:
Aishwarya Tonpe <aishwarya.tonpe25@gmail.com> Co-authored-by:
Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
-
one authored
-
one authored
-
- 17 Apr, 2026 3 commits
- 15 Apr, 2026 1 commit
-
-
one authored
-
- 02 Apr, 2026 5 commits
- 01 Apr, 2026 5 commits
- 27 Mar, 2026 1 commit
-
-
one authored
-
- 25 Mar, 2026 2 commits
-
-
Aishwarya Tonpe authored
Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison. Flags added - --enable-determinism --check-frequency: Number of steps after which you want the metrics to be recorded --deterministic-seed Changes - Updated pytorch_base.py to handle deterministic settings, logging. Added a new example script: pytorch_deterministic_example.py Added a test file: test_pytorch_determinism_all.py to verify everything works as expected. Usage - Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine) Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command Note - 1. Make sure all the parameters are constant between the 2 runs 2. Running the diagnosis command requires the rules.yaml file --------- Co-authored-by:Ubuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
-
one authored
-
- 19 Mar, 2026 3 commits
-
-
one authored
-
one authored
- Added Platform.DTK in the microbenchmark framework. - Introduced new DTK hipblaslt benchmark class and corresponding tests. - Updated Dockerfile to include hipblaslt-bench and its permissions. - Registered DTK benchmarks in the benchmark registry for various performance tests. - Enhanced GPU detection logic to recognize HYGON GPUs. This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
-
one authored
- Update rocm_commom.cmake for CMake>=3.24 - Prevent isolation build - Add BabelStream as a submodule - Update dockerignore
-
- 28 Jan, 2026 1 commit
-
-
Hongtao Zhang authored
**Description** - When building the CUDA 11.1.1 image, pip (Python 3.8) cannot find a pre-built wheel for the latest wandb release (v0.23.1). As a result, pip attempts to build wandb from source. However, the build fails because the image does not have Go installed, which is required for building wandb from source. Then the error appears. **Solution** - For the CUDA 11.1.1 build, install the required build tools (e.g., Go, Rust, and Cargo) needed for wandb. --------- Co-authored-by:
Hongtao Zhang <hongtaozhang@microsoft.com> Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com>
-
- 04 Dec, 2025 1 commit
-
-
Henry Li authored
**Description** The ib-loopback test was regressed due to this recent [change](https://github.com/microsoft/superbenchmark/commit/c65ae56713d6bfcc4a3be37d7fe24779590f9791). When running ib-loopback using the standard [config](https://github.com/microsoft/superbenchmark/blob/c65ae56713d6bfcc4a3be37d7fe24779590f9791/superbench/config/default.yaml#L69 ), the test would fail since it would pass numeric values like `0` into the test command which would break since it is not a valid IB device name. Example failure: ``` [2025-11-25 22:08:38,100 vmssnc6ec000003:141056][micro_base.py:200][INFO] Execute command - round: 0, benchmark: ib-loopback, command: /usr/local/bin/run_perftest_loopback 47 45 /usr/local/b in/ib_write_bw -s 8388608 -F --iters=20000 -d 0 -p 45617 -x 0 --report_gbits. [0]: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device [2025-11-25 22:08:39,113 vmssnc6ec000003:141056][micro_base.py:209][ERROR] Microbenchmark execution failed - round: 0, benchmark: ib-loopback, error message: IB device 0 not found Unable to find the Infiniband/RoCE device IB device 0 not found Unable to find the Infiniband/RoCE device ``` **Major Revision** - Major Revision A - Major Revision B - ... **Minor Revision** - Minor Revision A - Minor Revision B - ... --------- Co-authored-by:
Henry Li <lihl@microsoft.com>
-
- 17 Nov, 2025 1 commit
-
-
Yuting Jiang authored
Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733) **Description** add --set_ib_devices option to auto-select IB device by MPI local rank **Major Revision** - Add a new CLI flag --set_ib_devices to automatically select irregular IB devices based on the MPI local rank. - When enabled, the benchmark queries available IB devices via network.get_ib_devices() and selects the device corresponding to OMPI_COMM_WORLD_LOCAL_RANK. - Fall back to existing --ib_dev behavior when the flag is not provided. **Minor Revision** - Add an env in network.get_ib_devices() to allow user to set the device name
-
- 23 Oct, 2025 1 commit
-
-
Yuting Jiang authored
**Description** This PR adds NCU (NVIDIA Nsight Compute) profiling support to the cublaslt-gemm micro benchmark, enabling detailed kernel analysis including DRAM throughput, compute throughput, and launch arguments. **Major Revision** - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling - Modifies command execution to use NCU when profiling is enabled - Updates result parsing to handle both standard and NCU profiled output formats
-
- 22 Oct, 2025 1 commit
-
-
Ziyue Yang authored
Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark (#741) **Description** Adds verification and parallel run support for disk performance benchmark. **Major Revision** - Adds `--verify` flag to support verify written data. - Supports loading benchmark options from `PROC_RANK`, `BLOCK_DEVICES` and `NUMA_NODES` environmental variables. --------- Co-authored-by:guoshzhao <guzhao@microsoft.com>
-
- 08 Oct, 2025 1 commit
-
-
Hongtao Zhang authored
To improve benchmark debugging, the following debug methods were added: pytorch profiler in model benchmark - SB_ENABLE_PYTORCH_PROFILER: switch to enable/disable - SB_TORCH_PROFILER_TRACE_DIR: log path These 2 runtime variables need to be configured in SB config file. nsys in SB runner - SB_ENABLE_NSYS: switch to enable/disable - SB_NSYS_TRACE_DIR: log path These 2 runtime variables need to be configured in runner's ENV --------- Co-authored-by:Hongtao Zhang <hongtaozhang@microsoft.com>
-
- 01 Oct, 2025 1 commit
-
-
WenqingLan1 authored
Add support for cuda13.0. Add cuda13.0.dockerfile. Add cuda13.0 image building task to github pipeline. Update GPU STREAM to work with cuda13.0. Fix data type conversion perf bug in GPU stream. Update nvbandwidth submodule to be v0.8. Update perftest submodule to be 4bee61f80d9e268fc97eaf40be00409e91d3a19e (recent master). --------- Co-authored-by:
Ubuntu <dilipreddi@gmail.com> Co-authored-by:
guoshzhao <guzhao@microsoft.com>
-
- 29 Sep, 2025 1 commit
-
-
Yuting Jiang authored
**Description** add option to exclude data copy time in model benchmarks. **Major Revision** - add an option --no_copy - move start time after data copy finish
-