1. 12 Apr, 2023 1 commit
  2. 07 Apr, 2023 1 commit
  3. 06 Apr, 2023 4 commits
  4. 03 Apr, 2023 1 commit
    • guoshzhao's avatar
      Monitor - Fix the cgroup version checking logic. (#502) · 26373edb
      guoshzhao authored
      **Description**
      Looks `grep cgroup /proc/filesystems` doesn't work for NDv4 whose cgroup
      version is v1, but the result of this command got v2 for NDv4. Instead,
      checking the file existence to judge the cgroup version.
      26373edb
  5. 28 Mar, 2023 1 commit
  6. 25 Mar, 2023 1 commit
  7. 24 Mar, 2023 1 commit
  8. 22 Mar, 2023 2 commits
  9. 21 Mar, 2023 2 commits
  10. 20 Mar, 2023 2 commits
  11. 27 Feb, 2023 1 commit
    • Yuting Jiang's avatar
      Benchmarks: Revision - Support flexible warmup and non-random data... · eba298f5
      Yuting Jiang authored
      Benchmarks: Revision - Support flexible warmup and non-random data initialization in cublas-benchmark  (#479)
      
      **Description**
      revise cublas-benchmark for flexible warmup and fill data with fixed
      number for perf test to improve the running efficiency.
      
      **Major Revision**
      - remove num_in_steps for warmup to support more flexible warmup setting
      for users
      - Add support to generate input with fixed number for perf test
      eba298f5
  12. 13 Feb, 2023 2 commits
  13. 28 Jan, 2023 1 commit
  14. 17 Jan, 2023 1 commit
  15. 04 Jan, 2023 3 commits
  16. 03 Jan, 2023 6 commits
  17. 30 Dec, 2022 2 commits
  18. 29 Dec, 2022 1 commit
  19. 14 Dec, 2022 1 commit
  20. 29 Nov, 2022 1 commit
    • Yang Wang's avatar
      Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430) · e4eeda0a
      Yang Wang authored
      * add mpi-parallels mode
      
      * update according to comments
      
      * fix and update doc
      
      * update
      
      * merge into 'mpi' mode
      
      * udpate according to comments
      
      * fix testcases
      
      * fix ansible
      
      * regard pattern as field
      
      * udpate
      
      * fix flake8 version
      
      * add flake8 range
      
      * remove map-by from host config
      
      * udpate comments
      e4eeda0a
  21. 01 Nov, 2022 1 commit
    • Yifan Xiong's avatar
      CLI - Add non-zero return code for `sb [deploy,run]` (#425) · 1b86503d
      Yifan Xiong authored
      Add non-zero return code for `sb deploy` and `sb run` command when
      there're Ansible failures in control plane.
      Return code is set to count of failure.
      
      For failures caused by benchmarks, return code is still set per benchmark
      in results json file.
      1b86503d
  22. 31 Oct, 2022 1 commit
  23. 18 Oct, 2022 1 commit
  24. 06 Sep, 2022 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.6.0 (#409) · 63e9b2d1
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick bug fixes from v0.6.0 to main.
      
      **Major Revisions**
      
      * Enable latency test in ib traffic validation distributed benchmark (#396)
      * Enhance parameter parsing to allow spaces in value (#397)
      * Update apt packages in dockerfile (#398)
      * Upgrade colorlog for NO_COLOR support (#404)
      * Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
      * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
      * Enhance timeout cleanup to avoid possible hanging (#405)
      * Auto generate ibstat file by pssh (#402)
      * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
      * Docs - Upgrade version and release note (#407)
      * Docs - Fix issues in document (#408)
      Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
      Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
      63e9b2d1
  25. 23 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add support to store values of metrics in data diagnosis (#392) · 733860d7
      Yuting Jiang authored
      **Description**
      Add support to store values of metrics in data diagnosis.
      
      Take the following rules as example: 
      ```
          nccl_store_rule:
            categories: NCCL_DIS
            store: True
            metrics:
              - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
          nccl_rule:
            function: multi_rules
            criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
            categories: NCCL_DIS
      ```
      **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
      733860d7