1. 12 Aug, 2025 1 commit
  2. 11 Aug, 2025 1 commit
    • Hongtao Zhang's avatar
      Docs - Upgrade version and release note (#727) · adbf0357
      Hongtao Zhang authored
      Description
      
      Add release note for v0.12.0
      
      # Main Features
      ## SuperBench Improvement
      1. - [x] Update Image Build Pipeline (#659)
      2. - [x] Add support for arm64 build (#660)
      3. - [x] Upgrade dependency versions in pipeline (#671)
      4. - [x] Fix installation and lint issues (#684)
      5. - [x] Update Flake8 repo (#683)
      6. - [x] Init latest python support. (#687)
      7. - [x] Add image build on arm64 arch (#690)
      8. - [x] Enhancement of ignoring errors for import pkg_resources (#692)
      9. - [x] Update label in the ROCm image build (#693)
      10. - [x] Support cuda12.8 for Blackwell arch (#682)
      11. - [x] Merge multi-arch image (#696)
      12. - [x] Update OS of runner to the latest. (#702)
      13. - [x] cuda arch flag for cublaslt (#701)
      
      
      ## Micro-benchmark Improvement
      1. - [x] Bug Fix - Fix numa error on grace cpu in gpu-copy (#658)
      2. - [x] Dependency - Bump onnxruntime-gpu version from 1.10.0 to 1.12.0
      (#663)
      3. - [x] Benchmarks: micro benchmarks - add general CPU bandwidth and
      latency benchmark (#662)
      4. - [x] Benchmarks: micro benchmarks - add nvbandwidth build and
      benchmark (#665 and #669)
      5. - [x] Fix stderr message in gpu-copy benchmark (#673)
      6. - [x] Add arch support for 10.0 in gemm-flops (#680)
      7. - [x] Fix tensorrt-inference parsing (#674)
      8. - [x] nvbandwidth benchmark need to handle N/A value (#675)
      9. - [x] Avoid Unintended nvbandwidth Function Calls in All Benchmarks
      (#685)
      10. - [x] Add GPU Stream Micro Benchmark (#697)
      11. - [x] Cuda arch flag for cublaslt (#701)
      12. - [x] Support autotuning in cublaslt gemm (#706)
      14. - [x] Add FP4 GEMM FLOPS support for cublaslt_gemm benchmark (#711)
      15. - [x] CPU Stream Benchmark Revise (#712)
      16. - [x] Add cuda12.9 docker image (#716)
      17. - [x] Add Grace CPU support for CPU Stream (#719)
      
      
      ## Model Benchmark Improvement
      1. - [x] Add LLaMA-2 Models (#668)
      2. - [x] Fix typos in documentation and code files (#686)
      3. - [x] Add Mixture of Experts Model (#679) 
      4. - [ ] Add DeepSeek Training Benchmark
      5. - [x] Add DeepSeek Inference Benchmark (AMD GPU) (#713)
      
      
      ## Documentation
      1. - [x] Update CODEOWNERS (#670)
      2. - [x] Update CODEOWNERS (#718)
      
      ## Result Analysis
      1. - [x] Enhance logging information for diagnosis rule op baseline
      errors. (#689)
      adbf0357
  3. 06 Aug, 2025 1 commit
  4. 31 Jul, 2025 1 commit
  5. 30 Jun, 2025 1 commit
  6. 26 Jun, 2025 1 commit
  7. 25 Jun, 2025 1 commit
  8. 24 Jun, 2025 1 commit
  9. 20 Jun, 2025 2 commits
    • Babak Hejazi's avatar
      Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256
      Babak Hejazi authored
      **Description**
      Enable autotuning as an opt-in mode when benchmarking cublasLt via
      `cublaslt_gemm`
      
      The implementation is based on
      https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu
      
      The behavior of original benchmark command remains unchanged, e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`
      
      The new opt-in options are `-a` (for autotune) and `-I` (for autotune
      iterations, default is 50, same as the default for `-i`) and `-W` (for
      autotune warmups, default=20, same as the default for `-w`), e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
      -a`
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
      -I 10 -W 10`
      
      **Note:** This PR also changes the default `gemm_compute_type` for BF16
      and FP16 to `CUBLAS_COMPUTE_32F`.
      
      **Further observations:** 
      1. The support matrix of the `cublaslt_gemm` could be furt...
      60b13256
    • WenqingLan1's avatar
      Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4
      WenqingLan1 authored
      
      
      **Description**
      Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
      Stream supports dual socket benchmarking.
      
      Example config for this arch support:
      ```yaml
          cpu-stream:numa0:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0
              cores: 0 1 2 3 4 5 6 7 8
          cpu-stream:numa1:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 1
              cores: 64 65 66 67 68 69 70 71 72
          cpu-stream:numa-spread:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0 1
              cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
      ```
      
      ---------
      Co-authored-by: default avatardpower4 <dilipreddi@gmail.com>
      0b8d1fd4
  10. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  11. 14 Jun, 2025 1 commit
    • Hongtao Zhang's avatar
      microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051
      Hongtao Zhang authored
      
      
      In the current implementation, the CPU‑stream benchmark code renames the
      binary before the microbench base class can verify its existence,
      causing the default‐binary check to fail.
      
      This PR adds a “default” binary—built with the standard compile
      parameters—so that the base class can always find and validate it. Once
      the default binary is in place, the CPU‑stream code will rename it as
      needed and re‑check its presence before running the benchmark.
      
      The PR also enable CPU stream in the default settings.
      
      ---------
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      991c0051
  12. 05 Jun, 2025 1 commit
  13. 01 May, 2025 1 commit
  14. 30 Apr, 2025 1 commit
  15. 09 Apr, 2025 1 commit
  16. 21 Mar, 2025 1 commit
  17. 12 Mar, 2025 1 commit
    • Hongtao Zhang's avatar
      CI/CD - Update label in the ROCm image build (#693) · 48cd8a3c
      Hongtao Zhang authored
      
      
      Due to the matrix strategy’s default "fail-fast" setting. In GitHub
      Actions, when running a job with a matrix, the individual configurations
      run in parallel. By default, if one matrix job (for example, the one
      labeled "rocm6_2_rocm6_2_x_superbe") fails, the remaining parallel jobs
      are canceled automatically.
      
      In our current build image pipeline, the arm64 build job always are
      canceled by the rocm build job. So, using a non-existent label in the
      job config to prevent rocm build job from scheduling for a temporary
      solution.
      
      ---------
      Co-authored-by: default avatarhongtaozhang <hongtaozhang@microsoft.com>
      48cd8a3c
  18. 08 Mar, 2025 1 commit
  19. 07 Mar, 2025 1 commit
  20. 04 Mar, 2025 1 commit
  21. 25 Feb, 2025 2 commits
  22. 15 Feb, 2025 1 commit
  23. 05 Feb, 2025 2 commits
    • Hongtao Zhang's avatar
      Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647
      Hongtao Zhang authored
      
      
      **Description**
      
      1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
      in nvbandwidth cmd output.
      2. Replaced the input format of test cases with a list.
      3. Add nvbandwidth configuration example in default config files.
      
      ---------
      Co-authored-by: default avatarhongtaozhang <hongtaozhang@microsoft.com>
      Co-authored-by: default avatarYifan Xiong <yifan.xiong@microsoft.com>
      45d06647
    • Kirill Prosvirov's avatar
      Bug - Fix tensorrt-inference parsing (#674) · 7af7c0b7
      Kirill Prosvirov authored
      **Description**
      Today I was running a benchmark on my machine. And encountered a fancy
      issue with tensorrt-inference.
      I got code 33, which according to the source code is:
      ```
      MICROBENCHMARK_RESULT_PARSING_FAILURE = 33
      ```
      I dived deep into the code and found out the following problem. The
      parser stumbled upon getting to the following line:
      ```
      [11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms
      ```
      I ran it separately on the code and found out that the regular
      expression was not suitable for the cases like this, when you encounter
      an INT as a result in milliseconds.
      That's why this pull request is created.
      I came up with the closest possible regular expression to fix this issue
      and not to introduce any other bug.
      
      **Major Revision**
      - 0.11.0
      7af7c0b7
  24. 04 Feb, 2025 3 commits
  25. 08 Jan, 2025 1 commit
  26. 28 Nov, 2024 2 commits
  27. 27 Nov, 2024 1 commit
  28. 22 Nov, 2024 1 commit
  29. 21 Nov, 2024 2 commits
  30. 20 Nov, 2024 1 commit
  31. 15 Nov, 2024 1 commit
  32. 07 Nov, 2024 2 commits