1. 24 Apr, 2026 2 commits
    • one's avatar
      Benchmark: Update overlap and sharding matmul benchmarks (#19) · a961ebd4
      one authored
      - Enable `computation-communication-overlap` and `sharding-matmul` in
      some configs through the existing PyTorch distributed mode.
      - Use `torchrun --standalone` for single-node `torch.distributed` runs
      to avoid fixed rendezvous port conflicts on 29500.
      - Update runner command-generation test expectation for the new
      single-node torchrun behavior.
      a961ebd4
    • one's avatar
      Benchmark: Update ort-inference for ROCm platform (#18) · c77bfe36
      one authored
      * Support rocm in ort-inference
      
      * Add tests
      
      * Update dockerfiles for docker 18
      
      * Install onnx, add params to ort-inference
      
      * Update docs
      c77bfe36
  2. 23 Apr, 2026 1 commit
    • one's avatar
      Benchmarks: Add gpu-hpl and gpu-hpl-mxp micro benchmarks (#15) · 4fa10f4d
      one authored
      Add gpu-hpl and gpu-hpl-mxp micro benchmarks backed by rocHPL and rocHPL-MxP.
      
      Implemented a shared GPU HPL base that:
      - Generates per-workload HPL dat files and parses the corresponding output files.
      - Supports common HPL inputs such as process grid, matrix size, block size, broadcast topology, warmup, iterations, and reduce operator.
      - Adds rocHPL-specific tuning parameters for gpu-hpl.
      - Formats metric keys from input-derived workload attributes.
      - Reports `flops`, `time`, and `tests_pass` metrics with warmup-aware aggregation.
      
      Add benchmark registrations, parser tests, sample output fixtures, documentation, and recommended configurations for gpu-hpl and gpu-hpl-mxp.
      
      Update rocHPL and rocHPL-MxP third-party integration with build patches, install targets, and SuperBench run helper scripts.
      
      Also update gpu-hpcg metric naming to use flops instead of gflops, remove standalone domain/verification-style metrics from the documented metric surface, and refresh Hygon HPCG documentation/config references accordingly.
      4fa10f4d
  3. 21 Apr, 2026 3 commits
    • Hongtao Zhang's avatar
      Bugfix - gpu_stream: remove ROCm build support, require CUDA with NVML (#789) · 3c95714f
      Hongtao Zhang authored
      
      
      Summary
      
      The gpu_stream benchmark has NVIDIA-specific dependencies that prevent
      it from compiling on ROCm 6.3+. This change makes it CUDA-only,
      gracefully skipping the build with a warning on non-NVIDIA
        environments.
      
        Problem
      
      The gpu_stream benchmark fails to compile on ROCm 6.3+ due to multiple
      NVIDIA-specific dependencies:
      
      1. nvml.h — NVIDIA Management Library header, used for querying actual
      memory clock rates. No HIP equivalent. Referenced in gpu_stream.cu and
      gpu_stream_utils.hpp.
      2. cuda.h in headers — Three .hpp files (gpu_stream.hpp,
      gpu_stream_kernels.hpp, gpu_stream_utils.hpp) directly include <cuda.h>
      and <cuda_runtime.h>. These headers are not processed by hipify-perl
      (only
        .cu source files are), so they fail to resolve on ROCm.
      3. Deprecated hipDeviceProp_t struct fields — The code accesses
      memoryBusWidth, memoryClockRate, and ECCEnabled from the device
      properties struct. These fields were removed from hipDeviceProp_t in
      ROCm
          6.3, causing compilation errors after hipification.
      
      The existing ROCm path was marked as incomplete (# TODO: test for ROC)
      and was never fully functional on recent ROCm versions.
      
        Changes
      
      - Removed the non-functional ROCm/HIP build path from
      gpu_stream/CMakeLists.txt
      - When CUDA is not found, prints a warning and returns gracefully
      instead of attempting a broken hipify build or raising FATAL_ERROR
      - No changes to the NVIDIA/CUDA build path — it continues to work as
      before
      
        Impact
      
         - NVIDIA builds: No change — gpu_stream builds and installs normally
      - ROCm builds: gpu_stream is skipped with a warning message. Previously
      it would fail the entire make cppbuild step, blocking the Docker image
      build
      - Other benchmarks: Unaffected — build.sh continues to the next
      benchmark after gpu_stream returns
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      3c95714f
    • one's avatar
      Benchmarks: Update gpu-hpcg metrics to encode process and problem shape (#8) · 0a1a15ea
      one authored
      * Update gpu-hpcg metrics to encode process and problem shape
      
      * Fix tests
      0a1a15ea
    • one's avatar
      Runner: Add local numactl GPU affinity support (#6) · 0993db75
      one authored
      - Add `numactl` support for local runner modes, including `cpunodebind`, `membind`, and `physcpubind`.
      - Add `gpu_affinity` resolution through `sb node topo --get gpu-numa-affinity --gpu-id`.
      - Add `sb node topo` support for GPU NUMA topology queries.
      - Update BW1000 config to use the new local `numactl` semantics.
      - Document the new `numactl` mode fields and limitations.
      0993db75
  4. 20 Apr, 2026 1 commit
  5. 18 Apr, 2026 4 commits
    • one's avatar
      Fix some lint warnings (#3) · b31acf90
      one authored
      * Fix some lint warnings
      * Exclude some paths in cpplint
      * Fix some tests and formatting
      b31acf90
    • one's avatar
      Format python code on branch dtk · 2bf01d5e
      one authored
      2bf01d5e
    • one's avatar
      Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d
      one authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarAishwarya Tonpe <aishwarya.tonpe25@gmail.com>
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      47d4a79d
    • one's avatar
  6. 17 Apr, 2026 1 commit
  7. 15 Apr, 2026 1 commit
  8. 02 Apr, 2026 3 commits
  9. 01 Apr, 2026 3 commits
  10. 25 Mar, 2026 1 commit
    • Aishwarya Tonpe's avatar
      Benchmark: Model benchmark - deterministic training support (#731) · 036c4712
      Aishwarya Tonpe authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      036c4712
  11. 19 Mar, 2026 2 commits
    • one's avatar
      Migrate gpu-stream to BabelStream v5.0 · d4051602
      one authored
      d4051602
    • one's avatar
      Enhance DTK platform support and GPU detection · 1a57f2d6
      one authored
      - Added Platform.DTK in the microbenchmark framework.
      - Introduced new DTK hipblaslt benchmark class and corresponding tests.
      - Updated Dockerfile to include hipblaslt-bench and its permissions.
      - Registered DTK benchmarks in the benchmark registry for various performance tests.
      - Enhanced GPU detection logic to recognize HYGON GPUs.
      
      This update improves the benchmarking capabilities for DTK, ensuring compatibility and performance testing across platforms.
      1a57f2d6
  12. 17 Nov, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567
      Yuting Jiang authored
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)
      
      **Description**
      add --set_ib_devices option to auto-select IB device by MPI local rank 
      
      
      **Major Revision**
      - Add a new CLI flag --set_ib_devices to automatically select irregular
      IB devices based on the MPI local rank.
      - When enabled, the benchmark queries available IB devices via
      network.get_ib_devices() and selects the device corresponding to
      OMPI_COMM_WORLD_LOCAL_RANK.
      - Fall back to existing --ib_dev behavior when the flag is not provided.
      
      **Minor Revision**
      - Add an env in network.get_ib_devices() to allow user to set the device
      name
      c65ae567
  13. 23 Oct, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98
      Yuting Jiang authored
      **Description**
      This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
      cublaslt-gemm micro benchmark, enabling detailed kernel analysis
      including DRAM throughput, compute throughput, and launch arguments.
      
      **Major Revision**
      - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
      - Modifies command execution to use NCU when profiling is enabled
      - Updates result parsing to handle both standard and NCU profiled output
      formats
      f6e65a98
  14. 22 Oct, 2025 1 commit
  15. 29 Sep, 2025 2 commits
  16. 30 Jun, 2025 1 commit
  17. 26 Jun, 2025 1 commit
  18. 25 Jun, 2025 1 commit
  19. 20 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4
      WenqingLan1 authored
      
      
      **Description**
      Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
      Stream supports dual socket benchmarking.
      
      Example config for this arch support:
      ```yaml
          cpu-stream:numa0:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0
              cores: 0 1 2 3 4 5 6 7 8
          cpu-stream:numa1:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 1
              cores: 64 65 66 67 68 69 70 71 72
          cpu-stream:numa-spread:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0 1
              cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
      ```
      
      ---------
      Co-authored-by: default avatardpower4 <dilipreddi@gmail.com>
      0b8d1fd4
  20. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  21. 14 Jun, 2025 1 commit
    • Hongtao Zhang's avatar
      microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051
      Hongtao Zhang authored
      
      
      In the current implementation, the CPU‑stream benchmark code renames the
      binary before the microbench base class can verify its existence,
      causing the default‐binary check to fail.
      
      This PR adds a “default” binary—built with the standard compile
      parameters—so that the base class can always find and validate it. Once
      the default binary is in place, the CPU‑stream code will rename it as
      needed and re‑check its presence before running the benchmark.
      
      The PR also enable CPU stream in the default settings.
      
      ---------
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      991c0051
  22. 04 Mar, 2025 1 commit
  23. 15 Feb, 2025 1 commit
  24. 05 Feb, 2025 1 commit
  25. 28 Nov, 2024 1 commit
    • pdr's avatar
      Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1
      pdr authored
      Added llama benchmark - training and inference in accordance with the
      existing pytorch models implementation like gpt2, lstm etc.
      
      - added llama fp8 unit test for better code coverage, to reduce memory
      required
      - updated transformers version >= 4.28.0 for LLamaConfig
      - set tokenizers version <= 0.20.3 to avoid 0.20.4 version
      [issues](https://github.com/huggingface/tokenizers/issues/1691
      
      ) with
      py3.8
      - added llama2 to tensorrt
      - llama2 tests not added to test_tensorrt_inference_performance.py due
      to large memory requirement for worker gpu. tests validated separately
      on gh200
      
      ---------
      Co-authored-by: default avatardpatlolla <dpatlolla@microsoft.com>
      249e21c1
  26. 27 Nov, 2024 1 commit
  27. 22 Nov, 2024 1 commit
  28. 20 Nov, 2024 1 commit