1. 08 Jan, 2024 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.10.0 (#607) · 2c88db90
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick bug fixes from v0.10.0 to main.
      
      **Major Revisions**
      
      * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
      * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
      * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
      * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
      * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
      * CI/CD - Add ndv5 topo file #597
      * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
      * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
      * Dockerfile - Bug fix for rocm docker build and deploy #598
      * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
      * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
      * Monitor - Upgrade pyrsmi to amdsmi python library. #601
      * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
      * Dockerfile - Add rocm6.0 dockerfile #602
      * Bug Fix - Bug fix for latest megatron-lm benchmark #600
      * Docs - Upgrade version and release note #606
      Co-authored-by: default avatarZiyue Yang <ziyyang@microsoft.com>
      Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
      Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
      Co-authored-by: default avatarguoshzhao <guzhao@microsoft.com>
      2c88db90
  2. 10 Dec, 2023 1 commit
  3. 08 Dec, 2023 1 commit
  4. 07 Dec, 2023 1 commit
  5. 30 Jun, 2023 1 commit
  6. 21 Mar, 2023 1 commit
  7. 13 Feb, 2023 1 commit
  8. 04 Jan, 2023 1 commit
    • Yang Wang's avatar
      Runner - Generate host groups file in mpi mode (#458) · 8e748d56
      Yang Wang authored
      **Major Revision**
      
      - Add an option for pattern to generate mpi_pattern.txt file if
      specified the path.
      - In mpi pattern, serial_index and parallel_index will add in each
      benchmark as environment variables.
      
      **Minor Revision**
      - Fix typo
      8e748d56
  9. 03 Jan, 2023 1 commit
  10. 06 Sep, 2022 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.6.0 (#409) · 63e9b2d1
      Yifan Xiong authored
      
      
      **Description**
      
      Cherry-pick bug fixes from v0.6.0 to main.
      
      **Major Revisions**
      
      * Enable latency test in ib traffic validation distributed benchmark (#396)
      * Enhance parameter parsing to allow spaces in value (#397)
      * Update apt packages in dockerfile (#398)
      * Upgrade colorlog for NO_COLOR support (#404)
      * Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
      * Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
      * Enhance timeout cleanup to avoid possible hanging (#405)
      * Auto generate ibstat file by pssh (#402)
      * Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
      * Docs - Upgrade version and release note (#407)
      * Docs - Fix issues in document (#408)
      Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
      Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
      63e9b2d1
  11. 22 Aug, 2022 1 commit
  12. 09 Aug, 2022 1 commit
  13. 01 Aug, 2022 1 commit
    • Yuting Jiang's avatar
      Analyzer - Add failure check feature in data diagnosis (#378) · ec16d425
      Yuting Jiang authored
      **Description**
      Add failure check feature in data diagnosis.
      
      **Major Revision**
      - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest
      - Split performance issue and failedtest in categories
      
      
      **Minor Revision**
      - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
      ec16d425
  14. 26 Jul, 2022 1 commit
    • Jie Zhang's avatar
      Support topo-aware IB performance validation (#373) · ef4d6574
      Jie Zhang authored
      
      
      * Support topo-aware IB performance validation
      
      Add a new pattern `topo-aware`, so the user can run IB performance
      test based on VM's topology information. This way, the user can
      validate the IB performance across VM pairs with different distance
      as a quick test instead of pair-wise test.
      
      To run with topo-aware pattern, user needs to specify three required
      (and two optional) parameters in YAML config file:
      --pattern	topo-aware
      --ibstat	path to ibstat output
      --ibnetdiscover	path to ibnetdiscover output
      --min_dist	minimum distance of VM pairs (optional, default 2)
      --max_dist	maximum distance of VM pairs (optional, default 6)
      
      The newly added topo_aware module then parses the topology
      information, builds a graph, and generates the VM pairs with
      the specified distance (# hops).
      
      The specified IB test will then be running across these
      generated VM pairs.
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Add description about topology aware ib traffic tests
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Add unit test to verify generated topology aware config file
      
      This commit adds unit test to verify the generated topology aware
      config file is correct. To do so, four new data files are added in
      order to invoke gen_topo_aware_config function to generate topology
      aware config file, then compares it with the expected config file.
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      
      * Fix lint issue on Azure pipeline
      Signed-off-by: default avatarJie Zhang <jessezhang1010@gmail.com>
      ef4d6574
  15. 01 Jun, 2022 1 commit
    • user4543's avatar
      Analyzer - Fix bugs in data diagnosis (#355) · 54da021b
      user4543 authored
      **Description**
      Fix bugs in data diagnosis.
      
      **Major Revision**
      - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
      - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
      - fix bug of using wrong column index when applying format(red color and percentile) in the excel
      54da021b
  16. 10 Apr, 2022 1 commit
  17. 24 Mar, 2022 1 commit
  18. 16 Mar, 2022 1 commit
    • rafsalas19's avatar
      Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce
      rafsalas19 authored
      **Description**
      Modifications adding GPU-Burn to SuperBench.
      - added third party submodule
      - modified Makefile to make gpu-burn binary
      - added/modified microbenchmarks to add gpu-burn python scripts
      - modified default and azure_ndv4 configs to add gpu-burn
      ff51a3ce
  19. 15 Mar, 2022 1 commit
    • user4543's avatar
      Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18
      user4543 authored
      **Description**
      Add md and html output format for DataDiagnosis.
      
      **Major Revision**
      - add md and html support in file_handler
      - add interface in DataDiagnosis for md and HTML output
      
      **Minor Revision**
      - move excel and json output interface into DataDiagnosis
      b3c95f18
  20. 09 Feb, 2022 1 commit
  21. 21 Jan, 2022 1 commit
  22. 30 Dec, 2021 1 commit
    • Yifan Xiong's avatar
      Release - SuperBench v0.4.0 (#278) · ff563b66
      Yifan Xiong authored
      
      
      __Description__
      
      Cherry-pick  bug fixes from v0.4.0 to main.
      
      __Major Revisions__
      
      * Bug - Fix issues for Ansible and benchmarks (#267)
      * Tests - Refine test cases for microbenchmark (#268)
      * Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
      * Benchmarks: Fix Bug - Fix fio build issue (#272)
      * Docs - Unify metric and add doc for cublas and cudnn functions (#271)
      * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
      * Bug - Fix bug of detecting if gpu_index is none (#275)
      * Bug - Fix bugs in data diagnosis (#273)
      * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
      * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
      * Docs - Upgrade version and release note (#277)
      Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
      ff563b66
  23. 10 Dec, 2021 1 commit
    • guoshzhao's avatar
      Monitor: Integration - Integrate monitor into Superbench (#259) · 6e357fb9
      guoshzhao authored
      **Description**
      Integrate monitor into Superbench.
      
      **Major Revision**
      - Initialize, start and stop monitor in SB executor.
      - Parse the monitor data in SB runner and merge into benchmark results.
      - Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
      - Add monitor configs into config file.
      6e357fb9
  24. 12 Nov, 2021 1 commit