1. 25 Jul, 2022 1 commit
  2. 20 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Fix port conflict in ib loopback (#375) · 352ae0c9
      Yifan Xiong authored
      Fix potential port conflict due to race condition between time-to-check
      to time-to-use, by binding the port all through.
      
      Modify the function to resolve flake8 C901 while keeping the logic same.
      352ae0c9
  3. 09 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Fix issues in ib validation benchmark (#370) · b2875179
      Yifan Xiong authored
      Fix several issues in ib validation benchmark:
      * continue running when timeout in the middle, instead of aborting whole mpi process
      * make timeout parameter configurable, set default to 120 seconds
      * avoid mixture of stdio and iostream when print to stdout
      * set default message size to 8M which will saturate ib in most cases
      * fix hostfile path issue so that it can be auto found in different cases
      b2875179
  4. 29 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Fix issues in ib loopback benchmark (#369) · 620192a2
      Yifan Xiong authored
      Fix several issues in ib loopback benchmark:
      * use `--report_gbits` and divide by 8 to get GB/s, previous results are
        MiB/s / 1000
      * use the ib_write_bw binary built in third_party instead of system path
      * update the metrics name so that different hca indices have same metric
      620192a2
  5. 24 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Support multiple IB/GPU in ib validation (#363) · bfaa1c83
      Yifan Xiong authored
      **Description**
      
      Support multiple IB/GPU devices run simultaneously in ib validation benchmark.
      
      **Major Revisions**
      - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
      - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
      - Fix env issues in Dockerfile for end-to-end test.
      - Update ib-traffic configuration examples in config files.
      - Update unit tests and docs accordingly.
      
      Closes #326.
      bfaa1c83
  6. 15 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Fix cmake and build issues (#360) · 60a3c743
      Yifan Xiong authored
      **Description**
      
      Fix cmake and build issues.
      
      **Major Revision**
      
      * Remove unnecessary boost build
      * Remove user-agent for mlc
      * Remove -j for third party to build each project in sequence
      * Fix ansible collections installation path
      60a3c743
  7. 14 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Support `sb run` on host directly without Docker (#358) · a4937e95
      Yifan Xiong authored
      **Description**
      
      Support `sb run` on host directly without Docker
      
      **Major Revisions**
      - Add `--no-docker` argument for `sb run`.
      - Run on host directly if `--no-docker` if specified.
      - Update docs and tests correspondingly.
      a4937e95
  8. 29 Apr, 2022 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.5.0 (#350) · 6681c720
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick  bug fixes from v0.5.0 to main.
      
      **Major Revisions**
      
      * Bug - Force to fix ort version as '1.10.0' (#343)
      * Bug - Support no matching rules and unify the output name in result_summary (#345)
      * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
      * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
      * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
      * Docs - Upgrade version and release note (#348)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      6681c720
  9. 11 Apr, 2022 1 commit
  10. 01 Apr, 2022 1 commit
  11. 22 Mar, 2022 1 commit
  12. 17 Mar, 2022 1 commit
  13. 16 Mar, 2022 1 commit
    • rafsalas19's avatar
      Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce
      rafsalas19 authored
      **Description**
      Modifications adding GPU-Burn to SuperBench.
      - added third party submodule
      - modified Makefile to make gpu-burn binary
      - added/modified microbenchmarks to add gpu-burn python scripts
      - modified default and azure_ndv4 configs to add gpu-burn
      ff51a3ce
  14. 06 Mar, 2022 1 commit
  15. 24 Feb, 2022 1 commit
  16. 22 Feb, 2022 1 commit
  17. 10 Feb, 2022 1 commit
  18. 09 Feb, 2022 1 commit
  19. 08 Feb, 2022 1 commit
  20. 07 Feb, 2022 1 commit
    • Ziyue Yang's avatar
      Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055
      Ziyue Yang authored
      **Description**
      This commit does the following to optimize result variance in gpu_copy benchmark:
      1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
      2) Use CUDA events for timing instead of CPU timestamps;
      3) Make data checking an option that is not preferred to be enabled in performance test;
      4) Enlarge message size in performance benchmark.
      85389055
  21. 29 Jan, 2022 2 commits
  22. 28 Jan, 2022 1 commit
  23. 24 Jan, 2022 1 commit
  24. 21 Jan, 2022 1 commit
  25. 19 Jan, 2022 1 commit
  26. 18 Jan, 2022 1 commit
    • Yifan Xiong's avatar
      CLI - Add command sb benchmark [list,list-parameters] (#279) · f7ffc545
      Yifan Xiong authored
      __Description__
      
      Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.
      
      <details>
      <summary>Examples</summary>
      <pre>
      $ sb benchmark list -n [a-z]+-bw -o table
      Result
      --------
      mem-bw
      nccl-bw
      rccl-bw
      </pre>
      <pre>
      $ sb benchmark list-parameters -n mem-bw
      === mem-bw ===
      optional arguments:
        --bin_dir str         Specify the directory of the benchmark binary.
        --duration int        The elapsed time of benchmark in seconds.
        --mem_type str [str ...]
                              Memory types to benchmark. E.g. htod dtoh dtod.
        --memory str          Memory argument for bandwidthtest. E.g. pinned unpinned.
        --run_count int       The run count of benchmark.
        --shmoo_mode          Enable shmoo mode for bandwidthtest.
      default values:
      {'bin_dir': None,
       'duration': 0,
       'mem_type': ['htod', 'dtoh'],
       'memory': 'pinned',
       'run_count': 1}
      </pre>
      </details>
      
      __Major Revisions__
      * Add `sb benchmark list` to list benchmarks matching given name.
      * Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.
      
      __Minor Revisions__
      * Sort format help text for argparse.
      f7ffc545
  27. 30 Dec, 2021 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.4.0 (#278) · ff563b66
      Yifan Xiong authored
      
      
      __Description__
      
      Cherry-pick  bug fixes from v0.4.0 to main.
      
      __Major Revisions__
      
      * Bug - Fix issues for Ansible and benchmarks (#267)
      * Tests - Refine test cases for microbenchmark (#268)
      * Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
      * Benchmarks: Fix Bug - Fix fio build issue (#272)
      * Docs - Unify metric and add doc for cublas and cudnn functions (#271)
      * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
      * Bug - Fix bug of detecting if gpu_index is none (#275)
      * Bug - Fix bugs in data diagnosis (#273)
      * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
      * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
      * Docs - Upgrade version and release note (#277)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      ff563b66
  28. 13 Dec, 2021 3 commits
  29. 10 Dec, 2021 3 commits
  30. 09 Dec, 2021 1 commit
  31. 07 Dec, 2021 1 commit
  32. 02 Dec, 2021 1 commit
  33. 15 Nov, 2021 1 commit
    • guoshzhao's avatar
      Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239) · cc70f9c1
      guoshzhao authored
      **Description**
      Rename `nvidia_helper` utility as `device_manager` module and support more functions:
      ```
      device_manager.get_device_count()
      device_manager.get_device_utilization(idx)
      device_manager.get_device_temperature(idx)
      device_manager.get_device_power_limit(idx)
      device_manager.get_device_memory(idx)
      device_manager.get_device_row_remapped_info(idx)
      device_manager.get_device_ecc_error(idx)
      ```
      cc70f9c1
  34. 12 Nov, 2021 1 commit
  35. 09 Nov, 2021 1 commit