1. 24 Jun, 2025 1 commit
  2. 20 Jun, 2025 2 commits
    • Babak Hejazi's avatar
      Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256
      Babak Hejazi authored
      **Description**
      Enable autotuning as an opt-in mode when benchmarking cublasLt via
      `cublaslt_gemm`
      
      The implementation is based on
      https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu
      
      The behavior of original benchmark command remains unchanged, e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`
      
      The new opt-in options are `-a` (for autotune) and `-I` (for autotune
      iterations, default is 50, same as the default for `-i`) and `-W` (for
      autotune warmups, default=20, same as the default for `-w`), e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
      -a`
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
      -I 10 -W 10`
      
      **Note:** This PR also changes the default `gemm_compute_type` for BF16
      and FP16 to `CUBLAS_COMPUTE_32F`.
      
      **Further observations:** 
      1. The support matrix of the `cublaslt_gemm` could be further extended
      in the future to support non-FP16 output as well for FP8 inputs.
      2. Currently, the input matrices are initialized with values of 1.0 and
      2.0 which makes them less demanding in terms of power. Another future
      extension could be to enable another fill mode for, say, uniform random
      numbers between -1 and 1.
      3. cuBLAS workspace recommendations are listed under
      https://docs.nvidia.com/cuda/cublas/#cublassetworkspace
      
      
      
      Update (June 10, 2025): verified using higher level test driver with
      these commands:
      
      1. inline:
      ```
      python3 -c "                                                                            
      from superbench.benchmarks import BenchmarkRegistry, Platform
      from superbench.common.utils import logger
      
      parameters = (
          '--num_warmup 10 --num_steps 50 '
          '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
          '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
      )
      context = BenchmarkRegistry.create_benchmark_context(
          'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
      )
      benchmark = BenchmarkRegistry.launch_benchmark(context)
      logger.info('Result: {}'.format(benchmark.result))
      "
      ```
      
      2. newly added script: 
      `python3 examples/benchmarks/cublaslt_function.py`
      
      ---------
      Co-authored-by: default avatarBabak Hejazi <babakh@nvidia.com>
      60b13256
    • WenqingLan1's avatar
      Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4
      WenqingLan1 authored
      
      
      **Description**
      Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
      Stream supports dual socket benchmarking.
      
      Example config for this arch support:
      ```yaml
          cpu-stream:numa0:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0
              cores: 0 1 2 3 4 5 6 7 8
          cpu-stream:numa1:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 1
              cores: 64 65 66 67 68 69 70 71 72
          cpu-stream:numa-spread:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0 1
              cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
      ```
      
      ---------
      Co-authored-by: default avatardpower4 <dilipreddi@gmail.com>
      0b8d1fd4
  3. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  4. 14 Jun, 2025 1 commit
    • Hongtao Zhang's avatar
      microbenchmark - CPU Stream Benchmark Revise (#712) · 991c0051
      Hongtao Zhang authored
      
      
      In the current implementation, the CPU‑stream benchmark code renames the
      binary before the microbench base class can verify its existence,
      causing the default‐binary check to fail.
      
      This PR adds a “default” binary—built with the standard compile
      parameters—so that the base class can always find and validate it. Once
      the default binary is in place, the CPU‑stream code will rename it as
      needed and re‑check its presence before running the benchmark.
      
      The PR also enable CPU stream in the default settings.
      
      ---------
      Co-authored-by: default avatarHongtao Zhang <hongtaozhang@microsoft.com>
      991c0051
  5. 01 May, 2025 1 commit
  6. 21 Mar, 2025 1 commit
  7. 04 Mar, 2025 1 commit
  8. 25 Feb, 2025 1 commit
  9. 15 Feb, 2025 1 commit
  10. 05 Feb, 2025 2 commits
    • Hongtao Zhang's avatar
      Bugfix - nvbandwidth benchmark need to handle N/A value (#675) · 45d06647
      Hongtao Zhang authored
      
      
      **Description**
      
      1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values
      in nvbandwidth cmd output.
      2. Replaced the input format of test cases with a list.
      3. Add nvbandwidth configuration example in default config files.
      
      ---------
      Co-authored-by: default avatarhongtaozhang <hongtaozhang@microsoft.com>
      Co-authored-by: default avatarYifan Xiong <yifan.xiong@microsoft.com>
      45d06647
    • Kirill Prosvirov's avatar
      Bug - Fix tensorrt-inference parsing (#674) · 7af7c0b7
      Kirill Prosvirov authored
      **Description**
      Today I was running a benchmark on my machine. And encountered a fancy
      issue with tensorrt-inference.
      I got code 33, which according to the source code is:
      ```
      MICROBENCHMARK_RESULT_PARSING_FAILURE = 33
      ```
      I dived deep into the code and found out the following problem. The
      parser stumbled upon getting to the following line:
      ```
      [11/28/2024-17:03:11] [I] Latency: min = 7.2793 ms, max = 10.1606 ms, mean = 7.41642 ms, median = 7.39551 ms, percentile(99%) = 8 ms
      ```
      I ran it separately on the code and found out that the regular
      expression was not suitable for the cases like this, when you encounter
      an INT as a result in milliseconds.
      That's why this pull request is created.
      I came up with the closest possible regular expression to fix this issue
      and not to introduce any other bug.
      
      **Major Revision**
      - 0.11.0
      7af7c0b7
  11. 04 Feb, 2025 1 commit
  12. 28 Nov, 2024 2 commits
  13. 27 Nov, 2024 1 commit
  14. 22 Nov, 2024 1 commit
  15. 20 Nov, 2024 1 commit
  16. 06 Nov, 2024 1 commit
    • pdr's avatar
      Dockerfile - Add support for arm64 build (#660) · 47949127
      pdr authored
      Add support for arm64 build:
      
      - Updated dockerfile for arm64 build
      - extend cpu stream compilation for neoverse 
      - handle onnxruntime-gpu installation
      - third party builds filtering based on arch
      - disable cuda decode perf build for non x86
      47949127
  17. 05 Nov, 2024 1 commit
    • pdr's avatar
      Bug Fix - Fix numa error on grace cpu in gpu-copy (#658) · 59d36f7f
      pdr authored
      The current GPU Copy BW Performance fails on Nvidia Grace systems. This
      is due to the memory only numa node and thus the numa_run_on_node fails
      for such nodes and halts completely.
      
      This fix checks for the presence of assigned CPU cores for the numa
      node, on checking if it has no cpu cores assigned, it skips that
      specific node during the args creation and continues.
      59d36f7f
  18. 10 Oct, 2024 1 commit
  19. 20 Aug, 2024 1 commit
  20. 16 Aug, 2024 1 commit
  21. 13 Aug, 2024 1 commit
  22. 26 Jul, 2024 1 commit
  23. 23 Jul, 2024 1 commit
  24. 02 Apr, 2024 1 commit
  25. 08 Jan, 2024 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.10.0 (#607) · 2c88db90
      Yifan Xiong authored
      **Description**
      
      Cherry-pick bug fixes from v0.10.0 to main.
      
      **Major Revisions**
      
      * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
      * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
      * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
      * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
      * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
      * CI/CD - Add ndv5 topo file #597
      * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
      * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
      * Dockerfile - Bug fix for rocm docker build and deploy #598
      * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
      * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
      * Monitor - U...
      2c88db90
  26. 11 Dec, 2023 1 commit
  27. 10 Dec, 2023 1 commit
  28. 09 Dec, 2023 1 commit
  29. 08 Dec, 2023 1 commit
  30. 07 Dec, 2023 1 commit
  31. 05 Dec, 2023 1 commit
  32. 04 Dec, 2023 1 commit
  33. 27 Nov, 2023 1 commit
    • guoshzhao's avatar
      Monitor - Add support for AMD GPU. (#580) · 028819b3
      guoshzhao authored
      **Description**
      Add AMD support in monitor.
      
      **Major Revision**
      - Add library pyrsmi to collect metrics.
      - Currently can get device_utilization, device_power, device_used_memory
      and device_total_memory.
      028819b3
  34. 22 Nov, 2023 4 commits