1. 18 Apr, 2026 1 commit
    • one's avatar
      Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d
      one authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarAishwarya Tonpe <aishwarya.tonpe25@gmail.com>
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      47d4a79d
  2. 19 Mar, 2026 1 commit
  3. 20 Jun, 2025 2 commits
    • Babak Hejazi's avatar
      Benchmark - Support autotuning in cublaslt gemm (#706) · 60b13256
      Babak Hejazi authored
      **Description**
      Enable autotuning as an opt-in mode when benchmarking cublasLt via
      `cublaslt_gemm`
      
      The implementation is based on
      https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemmSimpleAutoTuning/sample_cublasLt_LtSgemmSimpleAutoTuning.cu
      
      The behavior of original benchmark command remains unchanged, e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w10000 -i 1000 -t fp8e4m3`
      
      The new opt-in options are `-a` (for autotune) and `-I` (for autotune
      iterations, default is 50, same as the default for `-i`) and `-W` (for
      autotune warmups, default=20, same as the default for `-w`), e.g.:
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3
      -a`
      - `cublaslt_gemm -m 2048 -n 12288 -k 1536 -w 10000 -i 1000 -t fp8e4m3 -a
      -I 10 -W 10`
      
      **Note:** This PR also changes the default `gemm_compute_type` for BF16
      and FP16 to `CUBLAS_COMPUTE_32F`.
      
      **Further observations:** 
      1. The support matrix of the `cublaslt_gemm` could be further extended
      in the future to support non-FP16 output as well for FP8 inputs.
      2. Currently, the input matrices are initialized with values of 1.0 and
      2.0 which makes them less demanding in terms of power. Another future
      extension could be to enable another fill mode for, say, uniform random
      numbers between -1 and 1.
      3. cuBLAS workspace recommendations are listed under
      https://docs.nvidia.com/cuda/cublas/#cublassetworkspace
      
      
      
      Update (June 10, 2025): verified using higher level test driver with
      these commands:
      
      1. inline:
      ```
      python3 -c "                                                                            
      from superbench.benchmarks import BenchmarkRegistry, Platform
      from superbench.common.utils import logger
      
      parameters = (
          '--num_warmup 10 --num_steps 50 '
          '--shapes 512,512,512 1024,1024,1024 --in_types fp16 fp32 '
          '--enable_autotune --num_warmup_autotune 20 --num_steps_autotune 50'
      )
      context = BenchmarkRegistry.create_benchmark_context(
          'cublaslt-gemm', platform=Platform.CUDA, parameters=parameters
      )
      benchmark = BenchmarkRegistry.launch_benchmark(context)
      logger.info('Result: {}'.format(benchmark.result))
      "
      ```
      
      2. newly added script: 
      `python3 examples/benchmarks/cublaslt_function.py`
      
      ---------
      Co-authored-by: default avatarBabak Hejazi <babakh@nvidia.com>
      60b13256
    • WenqingLan1's avatar
      Benchmark - Add Grace CPU support for CPU Stream (#719) · 0b8d1fd4
      WenqingLan1 authored
      
      
      **Description**
      Added support for Grace CPU neo2 architecture in CPU Stream. Now CPU
      Stream supports dual socket benchmarking.
      
      Example config for this arch support:
      ```yaml
          cpu-stream:numa0:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0
              cores: 0 1 2 3 4 5 6 7 8
          cpu-stream:numa1:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 1
              cores: 64 65 66 67 68 69 70 71 72
          cpu-stream:numa-spread:
            timeout: *default_timeout
            modes:
            - name: local
              parallel: no
            parameters:
              cpu_arch: neo2
              numa_mem_nodes: 0 1
              cores: 0 1 2 3 4 5 6 7 8 64 65 66 67 68 69 70 71 72
      ```
      
      ---------
      Co-authored-by: default avatardpower4 <dilipreddi@gmail.com>
      0b8d1fd4
  4. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  5. 05 Feb, 2025 1 commit
  6. 28 Nov, 2024 1 commit
    • pdr's avatar
      Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1
      pdr authored
      Added llama benchmark - training and inference in accordance with the
      existing pytorch models implementation like gpt2, lstm etc.
      
      - added llama fp8 unit test for better code coverage, to reduce memory
      required
      - updated transformers version >= 4.28.0 for LLamaConfig
      - set tokenizers version <= 0.20.3 to avoid 0.20.4 version
      [issues](https://github.com/huggingface/tokenizers/issues/1691
      
      ) with
      py3.8
      - added llama2 to tensorrt
      - llama2 tests not added to test_tensorrt_inference_performance.py due
      to large memory requirement for worker gpu. tests validated separately
      on gh200
      
      ---------
      Co-authored-by: default avatardpatlolla <dpatlolla@microsoft.com>
      249e21c1
  7. 22 Nov, 2024 1 commit
  8. 08 Dec, 2023 1 commit
  9. 24 Mar, 2023 1 commit
  10. 21 Mar, 2023 1 commit
  11. 13 Feb, 2023 1 commit
  12. 11 Apr, 2022 1 commit
  13. 16 Mar, 2022 1 commit
    • rafsalas19's avatar
      Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce
      rafsalas19 authored
      **Description**
      Modifications adding GPU-Burn to SuperBench.
      - added third party submodule
      - modified Makefile to make gpu-burn binary
      - added/modified microbenchmarks to add gpu-burn python scripts
      - modified default and azure_ndv4 configs to add gpu-burn
      ff51a3ce
  14. 08 Feb, 2022 1 commit
  15. 21 Jan, 2022 1 commit
  16. 13 Dec, 2021 1 commit
  17. 10 Dec, 2021 1 commit
  18. 25 Nov, 2021 1 commit
  19. 12 Nov, 2021 1 commit
  20. 09 Nov, 2021 1 commit
  21. 30 Oct, 2021 1 commit
  22. 27 Oct, 2021 1 commit
  23. 22 Oct, 2021 1 commit
  24. 12 Oct, 2021 1 commit
  25. 30 Aug, 2021 2 commits
  26. 27 Aug, 2021 1 commit
  27. 30 Jul, 2021 1 commit
  28. 26 Jul, 2021 1 commit
  29. 23 Jul, 2021 2 commits
  30. 13 Jul, 2021 1 commit
  31. 02 Jun, 2021 1 commit
  32. 01 Jun, 2021 1 commit
  33. 31 May, 2021 1 commit
  34. 19 May, 2021 2 commits
  35. 26 Apr, 2021 1 commit
  36. 20 Apr, 2021 1 commit