1. 23 Apr, 2026 1 commit
    • one's avatar
      Benchmarks: Add gpu-hpl and gpu-hpl-mxp micro benchmarks (#15) · 4fa10f4d
      one authored
      Add gpu-hpl and gpu-hpl-mxp micro benchmarks backed by rocHPL and rocHPL-MxP.
      
      Implemented a shared GPU HPL base that:
      - Generates per-workload HPL dat files and parses the corresponding output files.
      - Supports common HPL inputs such as process grid, matrix size, block size, broadcast topology, warmup, iterations, and reduce operator.
      - Adds rocHPL-specific tuning parameters for gpu-hpl.
      - Formats metric keys from input-derived workload attributes.
      - Reports `flops`, `time`, and `tests_pass` metrics with warmup-aware aggregation.
      
      Add benchmark registrations, parser tests, sample output fixtures, documentation, and recommended configurations for gpu-hpl and gpu-hpl-mxp.
      
      Update rocHPL and rocHPL-MxP third-party integration with build patches, install targets, and SuperBench run helper scripts.
      
      Also update gpu-hpcg metric naming to use flops instead of gflops, remove standalone domain/verification-style metrics from the documented metric surface, and refresh Hygon HPCG documentation/config references accordingly.
      4fa10f4d
  2. 21 Apr, 2026 5 commits
    • Hongtao Zhang's avatar
      Bugfix - gpu_stream: remove ROCm build support, require CUDA with NVML (#789) · 3c95714f
      Hongtao Zhang authored
      
      
      Summary
      
      The gpu_stream benchmark has NVIDIA-specific dependencies that prevent
      it from compiling on ROCm 6.3+. This change makes it CUDA-only,
      gracefully skipping the build with a warning on non-NVIDIA
        environments.
      
        Problem
      
      The gpu_stream benchmark fails to compile on ROCm 6.3+ due to multiple
      NVIDIA-specific dependencies:
      
      1. nvml.h — NVIDIA Management Library header, used for querying actual
      memory clock rates. No HIP equivalent. Referenced in gpu_stream.cu and
      gpu_stream_utils.hpp.
      2. cuda.h in headers — Three .hpp files (gpu_stream.hpp,
      gpu_stream_kernels.hpp, gpu_stream_utils.hpp) directly include <cuda.h>
      and <cuda_runtime.h>. These headers are not processed by hipify-perl
      (only
        .cu source files are), so they fail to resolve on ROCm.
      3. Deprecated hipDeviceProp_t struct fields — The code accesses
      memoryBusWidth, memoryClockRate, and ECCEnabled from the device
      properties struct. These fields were removed from hipDeviceProp_t in
      ROCm
          6.3, causing compilation errors after hipification.
      
      The existing ROCm path was marked as incomplete (# TODO: test for ROC)
      and was never fully functional on recent ROCm versions.
      
        Changes
      
      - Removed the non-functional ROCm/HIP build path from
      gpu_stream/CMakeLists.txt
      - When CUDA is not found, prints a warning and returns gracefully
      instead of attempting a broken hipify build or raising FATAL_ERROR
      - No changes to the NVIDIA/CUDA build path — it continues to work as
      before
      
        Impact
      
         - NVIDIA builds: No change — gpu_stream builds and installs normally
      - ROCm builds: gpu_stream is skipped with a warning message. Previously
      it would fail the entire make cppbuild step, blocking the Docker image
      build
      - Other benchmarks: Unaffected — build.sh continues to the next
      benchmark after gpu_stream returns
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      3c95714f
    • one's avatar
      Benchmarks: Update gpu-hpcg metrics to encode process and problem shape (#8) · 0a1a15ea
      one authored
      * Update gpu-hpcg metrics to encode process and problem shape
      
      * Fix tests
      0a1a15ea
    • one's avatar
      SysInfo: Simplify smi commands · d7a56e0b
      one authored
      d7a56e0b
    • one's avatar
      Config: Update config files (#7) · 511807b7
      one authored
      - Add BW150 config
      - Update BW1000 config
      - Merge summary rules
      511807b7
    • one's avatar
      Runner: Add local numactl GPU affinity support (#6) · 0993db75
      one authored
      - Add `numactl` support for local runner modes, including `cpunodebind`, `membind`, and `physcpubind`.
      - Add `gpu_affinity` resolution through `sb node topo --get gpu-numa-affinity --gpu-id`.
      - Add `sb node topo` support for GPU NUMA topology queries.
      - Update BW1000 config to use the new local `numactl` semantics.
      - Document the new `numactl` mode fields and limitations.
      0993db75
  3. 20 Apr, 2026 2 commits
  4. 18 Apr, 2026 5 commits
    • one's avatar
      Fix some lint warnings (#3) · b31acf90
      one authored
      * Fix some lint warnings
      * Exclude some paths in cpplint
      * Fix some tests and formatting
      b31acf90
    • one's avatar
      Format python code on branch dtk · 2bf01d5e
      one authored
      2bf01d5e
    • one's avatar
      Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d
      one authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarAishwarya Tonpe <aishwarya.tonpe25@gmail.com>
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      47d4a79d
    • one's avatar
      Format python code · 8c28b69a
      one authored
      8c28b69a
    • one's avatar
  5. 17 Apr, 2026 3 commits
  6. 15 Apr, 2026 1 commit
  7. 02 Apr, 2026 5 commits
  8. 01 Apr, 2026 5 commits
  9. 27 Mar, 2026 1 commit
  10. 25 Mar, 2026 2 commits
    • Aishwarya Tonpe's avatar
      Benchmark: Model benchmark - deterministic training support (#731) · 036c4712
      Aishwarya Tonpe authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      036c4712
    • one's avatar
      Improve DTK gemm-flops · 211e63c7
      one authored
      211e63c7
  11. 19 Mar, 2026 3 commits
    • one's avatar
      Migrate gpu-stream to BabelStream v5.0 · d4051602
      one authored
      d4051602
    • one's avatar
      Enhance DTK platform support and GPU detection · 1a57f2d6
      one authored
      - Added Platform.DTK in the microbenchmark framework.
      - Introduced new DTK hipblaslt benchmark class and corresponding tests.
      - Updated Dockerfile to include hipblaslt-bench and its permissions.
      - Registered DTK benchmarks in the benchmark registry for various performance tests.
      - Enhanced GPU detection logic to recognize HYGON GPUs.
      
      This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
      1a57f2d6
    • one's avatar
      Update DTK dockerfile and microbenchmarks · c4f39919
      one authored
      - Update rocm_commom.cmake for CMake>=3.24
      - Prevent isolation build
      - Add BabelStream as a submodule
      - Update dockerignore
      c4f39919
  12. 28 Jan, 2026 1 commit
  13. 04 Dec, 2025 1 commit
  14. 17 Nov, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567
      Yuting Jiang authored
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)
      
      **Description**
      add --set_ib_devices option to auto-select IB device by MPI local rank 
      
      
      **Major Revision**
      - Add a new CLI flag --set_ib_devices to automatically select irregular
      IB devices based on the MPI local rank.
      - When enabled, the benchmark queries available IB devices via
      network.get_ib_devices() and selects the device corresponding to
      OMPI_COMM_WORLD_LOCAL_RANK.
      - Fall back to existing --ib_dev behavior when the flag is not provided.
      
      **Minor Revision**
      - Add an env in network.get_ib_devices() to allow user to set the device
      name
      c65ae567
  15. 23 Oct, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98
      Yuting Jiang authored
      **Description**
      This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
      cublaslt-gemm micro benchmark, enabling detailed kernel analysis
      including DRAM throughput, compute throughput, and launch arguments.
      
      **Major Revision**
      - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
      - Modifies command execution to use NCU when profiling is enabled
      - Updates result parsing to handle both standard and NCU profiled output
      formats
      f6e65a98
  16. 22 Oct, 2025 1 commit
  17. 08 Oct, 2025 1 commit
    • Hongtao Zhang's avatar
      Enhancement: Add nsys and pytorch profiler debug trace support (#744) · d804dbb6
      Hongtao Zhang authored
      
      
      To improve benchmark debugging, the following debug methods were added:
      
      pytorch profiler in model benchmark
      
      - SB_ENABLE_PYTORCH_PROFILER: switch to enable/disable
      - SB_TORCH_PROFILER_TRACE_DIR: log path
      These 2 runtime variables need to be configured in SB config file.
      
      nsys in SB runner
      
      - SB_ENABLE_NSYS: switch to enable/disable 
      - SB_NSYS_TRACE_DIR: log path
      These 2 runtime variables need to be configured in runner's ENV
      
      ---------
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      d804dbb6
  18. 01 Oct, 2025 1 commit