1. 17 Apr, 2026 1 commit
  2. 02 Apr, 2026 2 commits
  3. 01 Apr, 2026 1 commit
  4. 20 Aug, 2024 1 commit
  5. 08 Aug, 2023 1 commit
  6. 03 Jan, 2023 1 commit
  7. 29 Nov, 2022 1 commit
    • Yang Wang's avatar
      Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430) · e4eeda0a
      Yang Wang authored
      * add mpi-parallels mode
      
      * update according to comments
      
      * fix and update doc
      
      * update
      
      * merge into 'mpi' mode
      
      * udpate according to comments
      
      * fix testcases
      
      * fix ansible
      
      * regard pattern as field
      
      * udpate
      
      * fix flake8 version
      
      * add flake8 range
      
      * remove map-by from host config
      
      * udpate comments
      e4eeda0a
  8. 01 Nov, 2022 1 commit
    • Yifan Xiong's avatar
      CLI - Add non-zero return code for `sb [deploy,run]` (#425) · 1b86503d
      Yifan Xiong authored
      Add non-zero return code for `sb deploy` and `sb run` command when
      there're Ansible failures in control plane.
      Return code is set to count of failure.
      
      For failures caused by benchmarks, return code is still set per benchmark
      in results json file.
      1b86503d
  9. 08 Jul, 2022 1 commit
    • Yifan Xiong's avatar
      Support node_num=1 in mpi mode (#372) · e00a8180
      Yifan Xiong authored
      Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
      both 1 node and all nodes in one config by changing `node_num`.
      Update docs and add test case accordingly.
      e00a8180
  10. 14 Jun, 2022 1 commit
    • Yifan Xiong's avatar
      Support `sb run` on host directly without Docker (#358) · a4937e95
      Yifan Xiong authored
      **Description**
      
      Support `sb run` on host directly without Docker
      
      **Major Revisions**
      - Add `--no-docker` argument for `sb run`.
      - Run on host directly if `--no-docker` if specified.
      - Update docs and tests correspondingly.
      a4937e95
  11. 29 Jan, 2022 1 commit
  12. 28 Jan, 2022 2 commits
    • guoshzhao's avatar
      Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287) · d03d110f
      guoshzhao authored
      **Description**
      Please write a brief description and link the related issue if have.
      
      **Major Revision**
      - Sync (do allreduce max) the E2E training results among all workers.
      - Avoid using ':0' in metric name if there has only one rank having output.
      d03d110f
    • guoshzhao's avatar
      Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288) · d877ca23
      guoshzhao authored
      **Description**
      Add timeout feature for each benchmark.
      
      **Major Revision**
      - Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
      - Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
         [ansible.py:80][WARNING] Run failed, return code 254.
      - Using `timeout` command to terminate the client process.
      d877ca23
  13. 30 Dec, 2021 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.4.0 (#278) · ff563b66
      Yifan Xiong authored
      
      
      __Description__
      
      Cherry-pick  bug fixes from v0.4.0 to main.
      
      __Major Revisions__
      
      * Bug - Fix issues for Ansible and benchmarks (#267)
      * Tests - Refine test cases for microbenchmark (#268)
      * Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
      * Benchmarks: Fix Bug - Fix fio build issue (#272)
      * Docs - Unify metric and add doc for cublas and cudnn functions (#271)
      * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
      * Bug - Fix bug of detecting if gpu_index is none (#275)
      * Bug - Fix bugs in data diagnosis (#273)
      * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
      * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
      * Docs - Upgrade version and release note (#277)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      ff563b66
  14. 10 Dec, 2021 1 commit
    • guoshzhao's avatar
      Monitor: Integration - Integrate monitor into Superbench (#259) · 6e357fb9
      guoshzhao authored
      **Description**
      Integrate monitor into Superbench.
      
      **Major Revision**
      - Initialize, start and stop monitor in SB executor.
      - Parse the monitor data in SB runner and merge into benchmark results.
      - Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
      - Add monitor configs into config file.
      6e357fb9
  15. 02 Dec, 2021 1 commit
  16. 26 Sep, 2021 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.3.0 (#212) · dfbd70b1
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick  bug fixes from v0.3.0 to main.
      
      **Major Revisions**
      * Docs - Upgrade version and release note (#209)
      * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
      * Benchmarks: Update - Update benchmarks in configuration file (#208)
      * CI/CD - Update GitHub Action VM (#211)
      * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
      * CI/CD - Fix bug in build image for push event (#205)
      * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
      * Tool: Fix bug - Fix function naming issue in system info  (#200)
      * CI/CD - Push images in GitHub Action (#202)
      * Bug - Fix torch.distributed command for single node (#201)
      * CLI - Integrate system info for node (#199)
      * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
      * CI/CD - Add ROCm image build in GitHub Actions (#194)
      * Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
      * Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
      * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
      * Bug - Revise 'docker run' in sb deploy (#195)
      * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)
      Co-authored-by: default avatarYuting Jiang <v-yujiang@microsoft.com>
      Co-authored-by: default avatarGuoshuai Zhao <guzhao@microsoft.com>
      Co-authored-by: default avatarZiyue Yang <ziyyang@microsoft.com>
      dfbd70b1
  17. 20 Aug, 2021 1 commit
    • guoshzhao's avatar
      Runner: Add Feature - Generate summarized output files. (#157) · 7595d794
      guoshzhao authored
      **Description**
      Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`
      
      **Major Revision**
      - Generate the summarized json file per node:
      For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
      For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
      `[]` means optional.
      ```
      {
        "kernel-launch/overhead_event:0": 0.00583,
        "kernel-launch/overhead_event:1": 0.00545,
        "kernel-launch/overhead_event:2": 0.00581,
        "kernel-launch/overhead_event:3": 0.00572,
        "kernel-launch/overhead_event:4": 0.00559,
        "kernel-launch/overhead_event:5": 0.00591,
        "kernel-launch/overhead_event:6": 0.00562,
        "kernel-launch/overhead_event:7": 0.00586,
        "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
        "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
        "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
        "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
        "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
        "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
        "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
        "pytorch-sharding-matmul/1/allgather": 10.088025093078613
      }
      ```
      - Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
      7595d794
  18. 19 Aug, 2021 1 commit
  19. 02 Jul, 2021 1 commit
    • Yifan Xiong's avatar
      Runner - Fetch benchmarks results on all nodes (#116) · fb7d4a73
      Yifan Xiong authored
      Fetch benchmarks results on all nodes, will rsync after each benchmark.
      The results directory structure on control node is as follows:
      
      ```
      outputs/
      └── datetime
          ├── nodes
          │   └── node-0
          │       ├── benchmarks
          │       │   ├── benchmark-0
          │       │   │   ├── rank-0
          │       │   │   │   └── results.json
          │       └── sb-exec.log
          ├── sb-run.log
          └── sb.config.yaml
      ```
      fb7d4a73
  20. 01 Jul, 2021 1 commit
  21. 23 Jun, 2021 1 commit
    • Yifan Xiong's avatar
      Bug bash - Fix bugs in multi GPU benchmarks (#98) · c0c43b8f
      Yifan Xiong authored
      * Add `sb deploy` command content.
      * Fix inline if-expression syntax in playbook.
      * Fix quote escape issue in bash command.
      * Add custom env in config.
      * Update default config for multi GPU benchmarks.
      * Update MANIFEST.in to include jinja2 template.
      * Require jinja2 minimum version.
      * Fix occasional duplicate output in Ansible runner.
      * Fix mixed color from Ansible and Python colorlog.
      * Update according to comments.
      * Change superbench.env from list to dict in config file.
      c0c43b8f
  22. 02 Jun, 2021 1 commit
  23. 28 May, 2021 1 commit
  24. 12 Apr, 2021 1 commit