1. 05 Sep, 2022 1 commit
  2. 02 Sep, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399) · db842892
      Yuting Jiang authored
      **Description**
       Make baseline check optional in data diagnosis and fix bugs.
      
      **Major Revision**
      - make baseline file optional in data diagnosis
      - fix bugs of output in md and excel format when 'function' is not in the rule
      - fix bug in multi_rules function that miss/failed test may failed the whole process
      
      **Minor Revision**
      - revise doc related with data diagnosis
      - resolve warning message about baseline not found check, only raise exception if baseline not found in the  'variance' function
      - move summary fields into top of json file
      - unify 'Index','machine' -> 'index' in output file
      db842892
  3. 01 Sep, 2022 1 commit
  4. 23 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add support to store values of metrics in data diagnosis (#392) · 733860d7
      Yuting Jiang authored
      **Description**
      Add support to store values of metrics in data diagnosis.
      
      Take the following rules as example: 
      ```
          nccl_store_rule:
            categories: NCCL_DIS
            store: True
            metrics:
              - nccl-bw:allreduce-run0/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run1/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run2/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run3/allreduce_1073741824_busbw
              - nccl-bw:allreduce-run4/allreduce_1073741824_busbw
          nccl_rule:
            function: multi_rules
            criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False'
            categories: NCCL_DIS
      ```
      **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
      733860d7
  5. 22 Aug, 2022 1 commit
  6. 09 Aug, 2022 1 commit
  7. 01 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add failure check feature in data diagnosis (#378) · ec16d425
      Yuting Jiang authored
      **Description**
      Add failure check feature in data diagnosis.
      
      **Major Revision**
      - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
      - Split performance issue and failedtest in categories
      
      
      **Minor Revision**
      - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
      ec16d425
  8. 01 Jun, 2022 1 commit
    • user4543's avatar
      Analyzer - Fix bugs in data diagnosis (#355) · 54da021b
      user4543 authored
      **Description**
      Fix bugs in data diagnosis.
      
      **Major Revision**
      - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
      - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
      - fix bug of using wrong column index when applying format(red color and percentile) in the excel
      54da021b
  9. 10 Apr, 2022 1 commit
  10. 24 Mar, 2022 1 commit
  11. 15 Mar, 2022 1 commit
    • user4543's avatar
      Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18
      user4543 authored
      **Description**
      Add md and html output format for DataDiagnosis.
      
      **Major Revision**
      - add md and html support in file_handler
      - add interface in DataDiagnosis for md and HTML output
      
      **Minor Revision**
      - move excel and json output interface into DataDiagnosis
      b3c95f18
  12. 07 Mar, 2022 1 commit
  13. 20 Feb, 2022 1 commit
  14. 30 Dec, 2021 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.4.0 (#278) · ff563b66
      Yifan Xiong authored
      
      
      __Description__
      
      Cherry-pick  bug fixes from v0.4.0 to main.
      
      __Major Revisions__
      
      * Bug - Fix issues for Ansible and benchmarks (#267)
      * Tests - Refine test cases for microbenchmark (#268)
      * Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
      * Benchmarks: Fix Bug - Fix fio build issue (#272)
      * Docs - Unify metric and add doc for cublas and cudnn functions (#271)
      * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
      * Bug - Fix bug of detecting if gpu_index is none (#275)
      * Bug - Fix bugs in data diagnosis (#273)
      * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
      * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
      * Docs - Upgrade version and release note (#277)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      ff563b66
  15. 08 Dec, 2021 1 commit