1. 21 Apr, 2026 1 commit
    • one's avatar
      Runner: Add local numactl GPU affinity support (#6) · 0993db75
      one authored
      - Add `numactl` support for local runner modes, including `cpunodebind`, `membind`, and `physcpubind`.
      - Add `gpu_affinity` resolution through `sb node topo --get gpu-numa-affinity --gpu-id`.
      - Add `sb node topo` support for GPU NUMA topology queries.
      - Update BW1000 config to use the new local `numactl` semantics.
      - Document the new `numactl` mode fields and limitations.
      0993db75
  2. 20 Apr, 2026 1 commit
  3. 18 Apr, 2026 3 commits
    • one's avatar
      Docs: Update gpu-hpcg introduction · 006f50c0
      one authored
      006f50c0
    • one's avatar
      Docs: Simplify gpu-hpcg metric list · f076f38f
      one authored
      f076f38f
    • one's avatar
      Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d
      one authored
      
      
      Adds opt-in deterministic training mode to SuperBench's PyTorch model
      benchmarks. When enabled --enable-determinism. PyTorch deterministic
      algorithms are enforced, and per-step numerical fingerprints (loss,
      activation means) are recorded as metrics. These can be compared across
      runs using the existing sb result diagnosis pipeline to verify bit-exact
      reproducibility — useful for hardware validation and platform
      comparison.
       
      Flags added - 
      
      --enable-determinism
      --check-frequency: Number of steps after which you want the metrics to
      be recorded
      --deterministic-seed
      
      Changes - 
      
      Updated pytorch_base.py to handle deterministic settings, logging.
      Added a new example script: pytorch_deterministic_example.py
      Added a test file: test_pytorch_determinism_all.py to verify everything
      works as expected.
      
      Usage - 
      
      Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file
      Step 2: Generate the baseline file from the Run 1 results using - sb
      result generate-baseline
      Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
      will be recorded in the results-summary.jsonl file on a different
      machine (or the same machine)
      Step 4: Run diagnosis on the results generated from the 2 runs using the
      - sb result diagnosis command
      
      Note - 
      1. Make sure all the parameters are constant between the 2 runs 
      2. Running the diagnosis command requires the rules.yaml file
      
      ---------
      Co-authored-by: default avatarAishwarya Tonpe <aishwarya.tonpe25@gmail.com>
      Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
      47d4a79d
  4. 17 Apr, 2026 1 commit
  5. 02 Apr, 2026 1 commit
  6. 01 Apr, 2026 2 commits
  7. 19 Mar, 2026 1 commit
  8. 12 Aug, 2025 1 commit
  9. 30 Jun, 2025 1 commit
  10. 25 Jun, 2025 1 commit
  11. 18 Jun, 2025 1 commit
    • WenqingLan1's avatar
      Benchmarks - Add GPU Stream Micro Benchmark (#697) · 4eddd50a
      WenqingLan1 authored
      Added GPU Stream benchmark - measures the GPU memory bandwidth and
      efficiency for double datatype through various memory operations
      including copy, scale, add, and triad.
      - added documentation for `gpu-stream` detailing its introduction,
      metrics, and descriptions.
      - added unit tests for `gpu-stream`. Example output is in
      `superbenchmark/tests/data/gpu_stream.log`.
      4eddd50a
  12. 28 Nov, 2024 1 commit
    • pdr's avatar
      Benchmarks - Add LLaMA-2 Models (#668) · 249e21c1
      pdr authored
      Added llama benchmark - training and inference in accordance with the
      existing pytorch models implementation like gpt2, lstm etc.
      
      - added llama fp8 unit test for better code coverage, to reduce memory
      required
      - updated transformers version >= 4.28.0 for LLamaConfig
      - set tokenizers version <= 0.20.3 to avoid 0.20.4 version
      [issues](https://github.com/huggingface/tokenizers/issues/1691
      
      ) with
      py3.8
      - added llama2 to tensorrt
      - llama2 tests not added to test_tensorrt_inference_performance.py due
      to large memory requirement for worker gpu. tests validated separately
      on gh200
      
      ---------
      Co-authored-by: default avatardpatlolla <dpatlolla@microsoft.com>
      249e21c1
  13. 27 Nov, 2024 1 commit
  14. 22 Nov, 2024 1 commit
  15. 10 Oct, 2024 1 commit
  16. 25 Jul, 2024 1 commit
  17. 08 Jan, 2024 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.10.0 (#607) · 2c88db90
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick bug fixes from v0.10.0 to main.
      
      **Major Revisions**
      
      * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
      * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
      * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
      * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
      * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
      * CI/CD - Add ndv5 topo file #597
      * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
      * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
      * Dockerfile - Bug fix for rocm docker build and deploy #598
      * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
      * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
      * Monitor - Upgrade pyrsmi to amdsmi python library. #601
      * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
      * Dockerfile - Add rocm6.0 dockerfile #602
      * Bug Fix - Bug fix for latest megatron-lm benchmark #600
      * Docs - Upgrade version and release note #606
      Co-authored-by: default avatarZiyue Yang <ziyyang@microsoft.com>
      Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
      Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
      Co-authored-by: default avatarguoshzhao <guzhao@microsoft.com>
      2c88db90
  18. 10 Dec, 2023 1 commit
  19. 09 Dec, 2023 1 commit
  20. 08 Dec, 2023 1 commit
  21. 07 Dec, 2023 1 commit
  22. 04 Dec, 2023 1 commit
  23. 22 Nov, 2023 1 commit
  24. 27 Jul, 2023 1 commit
    • Yuting Jiang's avatar
      Release - SuperBench v0.9.0 (#558) · e1df877b
      Yuting Jiang authored
      **Description**
      Cherry-pick bug fixes from v0.9.0 to main.
      
      **Major Revision**
      - CI/CD: pipeline - clean more disk space to fix rocm building image
      pipeline(#555 )
      - Benchmarks: bug fix - use absolute path for input file in
      DirectXEncodingLatency(#554)
      - CI/CD - add push win docker image on release branch in pipeline (#552)
      - Docs - Upgrade version and release note(#557)
      e1df877b
  25. 30 Jun, 2023 1 commit
  26. 29 Jun, 2023 1 commit
    • Yuting Jiang's avatar
      Tools - Add runner for sys info and update docs (#532) · ed027e4c
      Yuting Jiang authored
      **Description**
      Add runner for sys info to automatically collect on multiple nodes and
      update related docs.
      
      **Major Revision**
      - add runner for sys info which will check docker status and run `sb
      node info` on all nodes' docker and fetch results from all nodes
      
      **Minor Revision**
      - update cli and system-info doc
      - update sb node info to save output info output-dir/sys-info.json
      ed027e4c
  27. 04 May, 2023 1 commit
  28. 28 Apr, 2023 1 commit
  29. 14 Apr, 2023 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.8.0 (#517) · 51761b3a
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick bug fixes from v0.8.0 to main.
      
      **Major Revisions**
      
      * Monitor - Fix the cgroup version checking logic (#502)
      * Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
      * Fix wrong torch usage in communication wrapper for Distributed
      Inference Benchmark (#505)
      * Analyzer: Fix bug in python3.8 due to pandas api change (#504)
      * Bug - Fix bug to get metric from cmd when error happens (#506)
      * Monitor - Collect realtime GPU power when benchmarking (#507)
      * Add num_workers argument in model benchmark (#511)
      * Remove unreachable condition when write host list (#512)
      * Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
      * Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
      * Docs - Upgrade version and release note (#508)
      Co-authored-by: default avatarguoshzhao <guzhao@microsoft.com>
      Co-authored-by: default avatarZiyue Yang <ziyyang@microsoft.com>
      Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
      51761b3a
  30. 24 Mar, 2023 1 commit
  31. 22 Mar, 2023 1 commit
  32. 21 Mar, 2023 1 commit
  33. 13 Feb, 2023 1 commit
  34. 28 Jan, 2023 1 commit
  35. 04 Jan, 2023 2 commits
  36. 03 Jan, 2023 1 commit