1. 17 Mar, 2026 1 commit
  2. 11 Mar, 2026 1 commit
  3. 04 Feb, 2026 1 commit
  4. 28 Jan, 2026 1 commit
  5. 21 Dec, 2025 1 commit
    • Hongtao Zhang's avatar
      CI/CD - Fix Azure pipeline (#767) · c99380b4
      Hongtao Zhang authored
      
      
      **Description**
      Azure pipeline cpu-unit-test failed for "2025-12-10T03:47:59.0628597Z
      ERROR: Could not install packages due to an OSError: [Errno 28] No space
      left on device"
      
      **Root Cause**
      This happens because the matrix jobs (Python 3.7, 3.10, 3.12) run in
      parallel and share the same VM's disk. Python 3.12 downloads
      newer/larger packages (especially PyTorch and NVIDIA CUDA libraries
      which are ~3GB+), and when multiple jobs run simultaneously, they
      exhaust the disk space.
      
      **Fix**
      Disable the cache usage when installing SB
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      c99380b4
  6. 04 Dec, 2025 1 commit
  7. 17 Nov, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB... · c65ae567
      Yuting Jiang authored
      Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (#733)
      
      **Description**
      add --set_ib_devices option to auto-select IB device by MPI local rank 
      
      
      **Major Revision**
      - Add a new CLI flag --set_ib_devices to automatically select irregular
      IB devices based on the MPI local rank.
      - When enabled, the benchmark queries available IB devices via
      network.get_ib_devices() and selects the device corresponding to
      OMPI_COMM_WORLD_LOCAL_RANK.
      - Fall back to existing --ib_dev behavior when the flag is not provided.
      
      **Minor Revision**
      - Add an env in network.get_ib_devices() to allow user to set the device
      name
      c65ae567
  8. 06 Nov, 2025 1 commit
  9. 05 Nov, 2025 1 commit
  10. 23 Oct, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm (#740) · f6e65a98
      Yuting Jiang authored
      **Description**
      This PR adds NCU (NVIDIA Nsight Compute) profiling support to the
      cublaslt-gemm micro benchmark, enabling detailed kernel analysis
      including DRAM throughput, compute throughput, and launch arguments.
      
      **Major Revision**
      - Add --enable_ncu_profiling and --profiling_metrics for ncu profiling
      - Modifies command execution to use NCU when profiling is enabled
      - Updates result parsing to handle both standard and NCU profiled output
      formats
      f6e65a98
  11. 22 Oct, 2025 2 commits
  12. 08 Oct, 2025 2 commits
  13. 01 Oct, 2025 1 commit
  14. 30 Sep, 2025 1 commit
    • Yuting Jiang's avatar
      Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all... · 93e9d262
      Yuting Jiang authored
      Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (#736)
      
      **Description**
      Add simultanneously all-to-host / host-to-all bandwidth testcases to
      nvbandwidth .
      
      **Major Revision**
      - nvbandwidth.patch: Add simultanneously all-to-host / host-to-all
      bandwidth testcases to nvbandwidth
      - upgrade nvbandwidth submodule into v0.8
      - add patch into makefile build
      93e9d262
  15. 29 Sep, 2025 2 commits
  16. 19 Sep, 2025 1 commit
  17. 12 Aug, 2025 1 commit
  18. 30 Jun, 2025 1 commit
  19. 26 Jun, 2025 1 commit
  20. 25 Jun, 2025 1 commit
  21. 24 Jun, 2025 1 commit
  22. 20 Jun, 2025 2 commits
    • Babak Hejazi's avatar
      Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256
      Babak Hejazi authored
      **Description**
      Enable autotuning as an opt-in mode when benchmarking cublasLt via
      `cublaslt_gemm`
      
      The implementation is based on
      https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu
      
      The behavior of original benchmark command remains unchanged, e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`
      
      The new opt-in options are `-a` (for autotune) and `-I` (for autotune
      iterations, default is 50, same as the default for `-i`) and `-W` (for
      autotune warmups, default=20, same as the default for `-w`), e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
      -a`
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
      -I 10 -W 10`
      
      **Note:** This PR also changes the default `gemm_compute_type` for BF16
      and FP16 to `CUBLAS_COMPUTE_32F`.
      
      **Further observations:** 
      1. The support matrix of the `cublaslt_gemm` could be furt...
      60b13256
    • WenqingLan1's avatar
      Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4
      WenqingLan1 authored
      
      
      **Description**
      Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
      Stream supports dual socket benchmarking.
      
      Example config for this arch support:
      ```yaml
          cpu-stream:numa0:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0
              cores: 0 1 2 3 4 5 6 7 8
          cpu-stream:numa1:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 1
              cores: 64 65 66 67 68 69 70 71 72
          cpu-stream:numa-spread:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0 1
              cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
      ```
      
      ---------
      Co-authored-by: default avatardpower4 <dilipreddi@gmail.com>
      0b8d1fd4
  23. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  24. 14 Jun, 2025 1 commit
    • Hongtao Zhang's avatar
      microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051
      Hongtao Zhang authored
      
      
      In the current implementation, the CPU‑stream benchmark code renames the
      binary before the microbench base class can verify its existence,
      causing the default‐binary check to fail.
      
      This PR adds a “default” binary—built with the standard compile
      parameters—so that the base class can always find and validate it. Once
      the default binary is in place, the CPU‑stream code will rename it as
      needed and re‑check its presence before running the benchmark.
      
      The PR also enable CPU stream in the default settings.
      
      ---------
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      991c0051
  25. 05 Jun, 2025 1 commit
  26. 01 May, 2025 1 commit
  27. 30 Apr, 2025 1 commit
  28. 09 Apr, 2025 1 commit
  29. 21 Mar, 2025 1 commit
  30. 12 Mar, 2025 1 commit
    • Hongtao Zhang's avatar
      CI/CD - Update label in the ROCm image build (#693) · 48cd8a3c
      Hongtao Zhang authored
      
      
      Due to the matrix strategy’s default "fail-fast" setting. In GitHub
      Actions, when running a job with a matrix, the individual configurations
      run in parallel. By default, if one matrix job (for example, the one
      labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs
      are canceled automatically.
      
      In our current build image pipeline, the arm64 build job always are
      canceled by the rocm build job. So, using a non-existent label in the
      job config to prevent rocm build job from scheduling for a temporary
      solution.
      
      ---------
      Co-authored-by: default avatarhongtaozhang <hongtaozhang@microsoft.com>
      48cd8a3c
  31. 08 Mar, 2025 1 commit
  32. 07 Mar, 2025 1 commit
  33. 04 Mar, 2025 1 commit
  34. 25 Feb, 2025 2 commits
  35. 15 Feb, 2025 1 commit