1. 06 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Update dependencies and Dockerfile (#371) · 9f03d568
      Yifan Xiong authored
      Update dependencies and Dockerfile:
      * upgrade nccl-tests and rccl-tests to current latest version to match
        NCCL/RCCL versions
      * unify image tag names on DockerHub
      * remove verbose output in Dockerfile and minor fix some flags
      9f03d568
  2. 05 Jul, 2022 1 commit
  3. 29 Jun, 2022 2 commits
    • Yifan Xiong's avatar
      Fix issues in ib loopback benchmark (#369) · 620192a2
      Yifan Xiong authored
      Fix several issues in ib loopback benchmark:
      * use `--report_gbits` and divide by 8 to get GB/s, previous results are
        MiB/s / 1000
      * use the ib_write_bw binary built in third_party instead of system path
      * update the metrics name so that different hca indices have same metric
      620192a2
    • Yifan Xiong's avatar
      Deployment - Refine error message when GPU is not detected (#368) · 8ef7163a
      Yifan Xiong authored
      Refine error message when GPU is not detected.
      
      Possible solutions if hardware exists and drivers are already installed:
      * nvidia gpus:
        ```sh
        /sbin/modprobe nvidia-uvm
        D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
        mknod -m 666 /dev/nvidia-uvm c $D 0
        ```
      
      * amd gpus
        ```sh
        modprobe amdgpu
        ```
      8ef7163a
  4. 24 Jun, 2022 2 commits
    • Yifan Xiong's avatar
      Fix incorrect ulimit config in Dockerfile (#364) · 325a7338
      Yifan Xiong authored
      Fix incorrect ulimit nofile config in Dockerfile.
      
      Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
      325a7338
    • Yifan Xiong's avatar
      Support multiple IB/GPU in ib validation (#363) · bfaa1c83
      Yifan Xiong authored
      **Description**
      
      Support multiple IB/GPU devices run simultaneously in ib validation benchmark.
      
      **Major Revisions**
      - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
      - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
      - Fix env issues in Dockerfile for end-to-end test.
      - Update ib-traffic configuration examples in config files.
      - Update unit tests and docs accordingly.
      
      Closes #326.
      bfaa1c83
  5. 19 Jun, 2022 2 commits
  6. 15 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Fix cmake and build issues (#360) · 60a3c743
      Yifan Xiong authored
      **Description**
      
      Fix cmake and build issues.
      
      **Major Revision**
      
      * Remove unnecessary boost build
      * Remove user-agent for mlc
      * Remove -j for third party to build each project in sequence
      * Fix ansible collections installation path
      60a3c743
  7. 14 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Support `sb run` on host directly without Docker (#358) · a4937e95
      Yifan Xiong authored
      **Description**
      
      Support `sb run` on host directly without Docker
      
      **Major Revisions**
      - Add `--no-docker` argument for `sb run`.
      - Run on host directly if `--no-docker` if specified.
      - Update docs and tests correspondingly.
      a4937e95
  8. 06 Jun, 2022 1 commit
  9. 02 Jun, 2022 2 commits
  10. 01 Jun, 2022 1 commit
    • user4543's avatar
      Analyzer - Fix bugs in data diagnosis (#355) · 54da021b
      user4543 authored
      **Description**
      Fix bugs in data diagnosis.
      
      **Major Revision**
      - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
      - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
      - fix bug of using wrong column index when applying format(red color and percentile) in the excel
      54da021b
  11. 31 May, 2022 1 commit
  12. 27 May, 2022 1 commit
  13. 25 May, 2022 1 commit
  14. 29 Apr, 2022 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.5.0 (#350) · 6681c720
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick  bug fixes from v0.5.0 to main.
      
      **Major Revisions**
      
      * Bug - Force to fix ort version as '1.10.0' (#343)
      * Bug - Support no matching rules and unify the output name in result_summary (#345)
      * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
      * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
      * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
      * Docs - Upgrade version and release note (#348)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      6681c720
  15. 20 Apr, 2022 1 commit
  16. 15 Apr, 2022 1 commit
  17. 11 Apr, 2022 2 commits
  18. 10 Apr, 2022 1 commit
  19. 08 Apr, 2022 2 commits
  20. 01 Apr, 2022 1 commit
  21. 31 Mar, 2022 1 commit
  22. 24 Mar, 2022 1 commit
  23. 22 Mar, 2022 1 commit
  24. 21 Mar, 2022 1 commit
  25. 17 Mar, 2022 1 commit
  26. 16 Mar, 2022 1 commit
    • rafsalas19's avatar
      Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce
      rafsalas19 authored
      **Description**
      Modifications adding GPU-Burn to SuperBench.
      - added third party submodule
      - modified Makefile to make gpu-burn binary
      - added/modified microbenchmarks to add gpu-burn python scripts
      - modified default and azure_ndv4 configs to add gpu-burn
      ff51a3ce
  27. 15 Mar, 2022 2 commits
  28. 09 Mar, 2022 1 commit
  29. 07 Mar, 2022 2 commits
  30. 06 Mar, 2022 1 commit
  31. 28 Feb, 2022 2 commits