Release - SuperBench v0.4.0 (#278)

__Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

Release - SuperBench v0.4.0 (#278)
__Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
ff563b66 · Yifan Xiong · GitHub · 682ed06a · ff563b66 · ff563b66
Unverified Commit ff563b66 authored Dec 30, 2021 by Yifan Xiong Committed by GitHub Dec 30, 2021
20 changed files
--- a/README.md
+++ b/README.md
@@ -7,15 +7,15 @@
 [![Docker Pulls](https://img.shields.io/docker/pulls/superbench/superbench.svg)](https://hub.docker.com/r/superbench/superbench/tags)
 [![License](https://img.shields.io/github/license/microsoft/superbenchmark.svg)](LICENSE)
-| Azure Pipelines | Build Status |
+| Azure Pipelines          | Build Status                                                                                                                                                                                                            |
-| :---: | :---: |
+|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| cpu-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main) |
+| cpu-unit-test            | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main)            |
-| cuda-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main) |
+| cuda-unit-test           | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main)           |
 | ansible-integration-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/ansible-integration-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=82&branchName=main) |
 __SuperBench__ is a validation and profiling tool for AI infrastructure.
-📢 [v0.3.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.3.0) has been released!
+📢 [v0.4.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.4.0) has been released!
 ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._

--- a/dockerfile/rocm4.0-pytorch1.7.0.dockerfile
+++ b/dockerfile/rocm4.0-pytorch1.7.0.dockerfile
@@ -63,26 +63,26 @@ RUN mkdir -p /root/.ssh && \
    echo -e "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
    echo -e "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf
+# Install OFED
+ENV OFED_VERSION=5.2-2.2.3.0
+RUN cd /tmp && \
+    wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
+    tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
+    PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
+    rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
 # Install OpenMPI
 ENV OPENMPI_VERSION=4.0.5
 RUN cd /tmp && \
    wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
    tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
    cd openmpi-${OPENMPI_VERSION} && \
-    ./configure --enable-orterun-prefix-by-default && \
+    ./configure --enable-orterun-prefix-by-default --with-ucx=/usr --enable-mca-no-build=btl-uct && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \
    rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
-# Install OFED
-ENV OFED_VERSION=5.2-2.2.3.0
-RUN cd /tmp && \
-    wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
-    tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
-    PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
-    rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
 # Install HPC-X
 RUN cd /opt && \
    wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tbz && \

--- a/dockerfile/rocm4.2-pytorch1.7.0.dockerfile
+++ b/dockerfile/rocm4.2-pytorch1.7.0.dockerfile
@@ -69,7 +69,7 @@ RUN cd /tmp && \
    wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
    tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
    cd openmpi-${OPENMPI_VERSION} && \
-    ./configure --enable-orterun-prefix-by-default && \
+    ./configure --enable-orterun-prefix-by-default --with-ucx=/opt/ucx --enable-mca-no-build=btl-uct && \
    make -j $(nproc) all && \
    make install && \
    ldconfig && \

--- a/docs/getting-started/installation.mdx
+++ b/docs/getting-started/installation.mdx
@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
 :::note Note
 You should checkout corresponding tag to use release version, for example,
-`git clone -b v0.3.0 https://github.com/microsoft/superbenchmark`
+`git clone -b v0.4.0 https://github.com/microsoft/superbenchmark`
 :::
 ```bash

--- a/docs/getting-started/run-superbench.md
+++ b/docs/getting-started/run-superbench.md
@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
 :::note Note
 You should deploy corresponding Docker image to use release version, for example,
-`sb deploy -f local.ini -i superbench/superbench:v0.3.0-cuda11.1.1`
+`sb deploy -f local.ini -i superbench/superbench:v0.4.0-cuda11.1.1`
 You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.

--- a/docs/superbench-config.mdx
+++ b/docs/superbench-config.mdx
@@ -70,7 +70,7 @@ superbench:
 <TabItem value='example'>
 ```yaml
-version: v0.3
+version: v0.4
 superbench:
  enable: benchmark_1
  monitor:

--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -60,11 +60,40 @@ Large scale matmul operation using `torch.matmul` with one GPU.
 ### `cublas-function`
-TODO
+#### Introduction
+Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
+The supported functions for cuBLAS are as follows:
+ - cublasSgemm
+ - cublasSgemmStridedBatched
+ - cublasGemmStridedBatchedEx
+ - cublasGemmEx
+ - cublasCgemm3mStridedBatched
+ - cublasCgemm
+#### Metrics
+| Name                                                     | Unit      | Description                                                       |
+|----------------------------------------------------------|-----------|-------------------------------------------------------------------|
+| cublas-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. |
 ### `cudnn-function`
-TODO
+#### Introduction
+Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
+The supported functions for cuDNN are as follows:
+ - cudnnConvolutionBackwardFilter
+ - cudnnConvolutionBackwardData
+ - cudnnConvolutionForward
+#### Metrics
+| Name                                                    | Unit      | Description                                                      |
+|---------------------------------------------------------|-----------|------------------------------------------------------------------|
+| cudnn-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
 ### `tensorrt-inference`

--- a/docs/user-tutorial/container-images.mdx
+++ b/docs/user-tutorial/container-images.mdx
@@ -29,6 +29,7 @@ available tags are listed below for all stable versions.
 | Tag               | Description                        |
 | ----------------- | ---------------------------------- |
+| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
 | v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
 | v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
 | v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 |
@@ -38,6 +39,8 @@ available tags are listed below for all stable versions.
 | Tag                         | Description                                    |
 | --------------------------- | ---------------------------------------------- |
+| v0.4.0-rocm4.2-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.2, PyTorch 1.7.0 |
+| v0.4.0-rocm4.0-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.0, PyTorch 1.7.0 |
 | v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 |
 | v0.3.0-rocm4.0-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.0, PyTorch 1.7.0 |

--- a/docs/user-tutorial/data-diagnosis.md
+++ b/docs/user-tutorial/data-diagnosis.md
@@ -64,7 +64,7 @@ superbench:
 example:
 ```yaml
 # SuperBench rules
-version: v0.3
+version: v0.4
 superbench:
  rules:
    failure-rule:

--- a/setup.py
+++ b/setup.py
@@ -165,7 +165,7 @@ def run(self):
            'pytest>=6.2.2',
            'types-pyyaml',
            'vcrpy>=4.1.1',
-            'yapf>=0.30.0',
+            'yapf==0.31.0',
        ],
        'nvidia': ['py3nvml>=0.2.6'],
        'ort': [

--- a/superbench/__init__.py
+++ b/superbench/__init__.py
@@ -6,5 +6,5 @@
 Provide hardware and software benchmarks for AI systems.
 """
-__version__ = '0.3.0'
+__version__ = '0.4.0'
 __author__ = 'Microsoft'
--- a/superbench/analyzer/data_diagnosis.py
+++ b/superbench/analyzer/data_diagnosis.py
@@ -5,12 +5,13 @@
 import re
 from typing import Callable
+from pathlib import Path
 import pandas as pd
 from superbench.common.utils import logger
 from superbench.analyzer.diagnosis_rule_op import RuleOp, DiagnosisRuleType
-import superbench.analyzer.file_handler as file_handler
+from superbench.analyzer import file_handler
 class DataDiagnosis():
@@ -31,10 +32,15 @@ def _get_metrics_by_benchmarks(self, metrics_list):
        """
        benchmarks_metrics = {}
        for metric in metrics_list:
-            benchmark = metric.split('/')[0]
+            if '/' not in metric:
-            if benchmark not in benchmarks_metrics:
+                logger.warning(
-                benchmarks_metrics[benchmark] = set()
+                    'DataDiagnosis: get_metrics_by_benchmarks - {} does not have benchmark_name'.format(metric)
-            benchmarks_metrics[benchmark].add(metric)
+                )
+            else:
+                benchmark = metric.split('/')[0]
+                if benchmark not in benchmarks_metrics:
+                    benchmarks_metrics[benchmark] = set()
+                benchmarks_metrics[benchmark].add(metric)
        return benchmarks_metrics
    def _check_rules(self, rule, name):
@@ -133,6 +139,7 @@ def _get_criteria(self, rule_file, baseline_file):
                            if re.search(metric_regex, metric):
                                self._sb_rules[rule]['metrics'][metric] = self._get_baseline_of_metric(baseline, metric)
                                self._enable_metrics.append(metric)
+            self._enable_metrics.sort()
        except Exception as e:
            logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e)))
            return False
@@ -171,8 +178,8 @@ def _run_diagnosis_rules_for_single_node(self, node):
                issue_label = True
        if issue_label:
            # Add category information
-            general_cat_str = ','.join(categories)
+            general_cat_str = ','.join(sorted(list(categories)))
-            details_cat_str = ','.join(details)
+            details_cat_str = ','.join(sorted((details)))
            details_row = [general_cat_str, details_cat_str]
            return details_row, summary_data_row
@@ -236,15 +243,15 @@ def run(self, raw_data_file, rule_file, baseline_file, output_dir, output_format
        try:
            self._raw_data_df = file_handler.read_raw_data(raw_data_file)
            self._metrics = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns))
-            logger.info('DataDiagnosis: Begin to processe {} nodes'.format(len(self._raw_data_df)))
+            logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df)))
            data_not_accept_df, label_df = self.run_diagnosis_rules(rule_file, baseline_file)
            logger.info('DataDiagnosis: Processed finished')
-            outpout_path = ''
+            output_path = ''
            if output_format == 'excel':
-                output_path = output_dir + '/diagnosis_summary.xlsx'
+                output_path = str(Path(output_dir) / 'diagnosis_summary.xlsx')
-                file_handler.output_excel(self._raw_data_df, data_not_accept_df, outpout_path, self._sb_rules)
+                file_handler.output_excel(self._raw_data_df, data_not_accept_df, output_path, self._sb_rules)
            elif output_format == 'json':
-                output_path = output_dir + '/diagnosis_summary.jsonl'
+                output_path = str(Path(output_dir) / 'diagnosis_summary.jsonl')
                file_handler.output_json_data_not_accept(data_not_accept_df, output_path)
            else:
                logger.error('DataDiagnosis: output failed - unsupported output format')

--- a/superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py
+++ b/superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py
@@ -129,10 +129,11 @@ def export_torchvision_model(self, model_name, batch_size=1):
        if not self.check_torchvision_model(model_name):
            return ''
        file_name = str(self._onnx_model_path / (model_name + '.onnx'))
-        input_shape = (batch_size, 3, 224, 224)
+        model = getattr(torchvision.models, model_name)(pretrained=False).eval().cuda()
+        dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda')
        torch.onnx.export(
-            getattr(torchvision.models, model_name)(pretrained=False).eval().cuda(),
+            model,
-            torch.randn(input_shape, device='cuda'),
+            dummy_input,
            file_name,
            opset_version=10,
            operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
@@ -147,6 +148,10 @@ def export_torchvision_model(self, model_name, batch_size=1):
                }
            },
        )
+        del model
+        del dummy_input
+        torch.cuda.empty_cache()
        return file_name
    def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
@@ -163,13 +168,13 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
        if not self.check_benchmark_model(model_name):
            return
        file_name = str(self._onnx_model_path / (model_name + '.onnx'))
-        input_shape, dtype = (batch_size, seq_length), torch.int64
+        model = self.benchmark_models[model_name]().eval().cuda()
+        dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda')
        if model_name == 'lstm':
-            input_shape += (self.lstm_input_size, )
+            dummy_input = torch.ones((batch_size, seq_length, self.lstm_input_size), device='cuda')
-            dtype = None
        torch.onnx.export(
-            self.benchmark_models[model_name]().eval().cuda(),
+            model,
-            torch.ones(input_shape, dtype=dtype, device='cuda'),
+            dummy_input,
            file_name,
            opset_version=10,
            do_constant_folding=True,
@@ -185,4 +190,8 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
                }
            },
        )
+        del model
+        del dummy_input
+        torch.cuda.empty_cache()
        return file_name
--- a/superbench/benchmarks/micro_benchmarks/cublas_function.py
+++ b/superbench/benchmarks/micro_benchmarks/cublas_function.py
@@ -291,8 +291,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
                    raw_data = raw_data.split(',')
                    raw_data.pop()
                    raw_data = [float(item) for item in raw_data]
-                    self._result.add_result(metric, statistics.mean(raw_data))
+                    self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data))
-                    self._result.add_raw_data(metric, raw_data)
+                    self._result.add_raw_data(metric.lower() + '_time', raw_data)
                if 'Error' in line:
                    error = True
        except BaseException as e:

--- a/superbench/benchmarks/micro_benchmarks/cudnn_function.py
+++ b/superbench/benchmarks/micro_benchmarks/cudnn_function.py
@@ -6,6 +6,7 @@
 import os
 import json
 import yaml
+import statistics
 from superbench.common.utils import logger
 from superbench.benchmarks import Platform, BenchmarkRegistry, ReturnCode
@@ -424,8 +425,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
                    raw_data = raw_data.split(',')
                    raw_data.pop()
                    raw_data = [float(item) for item in raw_data]
-                    self._result.add_result(metric, sum(raw_data) / len(raw_data))
+                    self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data) * 1000)
-                    self._result.add_raw_data(metric, raw_data)
+                    self._result.add_raw_data(metric.lower() + '_time', raw_data)
                if 'Error' in line:
                    error = True
        except BaseException as e:

--- a/superbench/benchmarks/micro_benchmarks/ib_validation_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/ib_validation_performance.py
@@ -249,7 +249,7 @@ def __prepare_general_ib_command_params(self):
            msg_size = '-s ' + str(self._args.msg_size)
        # Add GPUDirect for ib command
        gpu_enable = ''
-        if self._args.gpu_index:
+        if self._args.gpu_index is not None:
            gpu = GPU()
            if gpu.vendor == 'nvidia':
                gpu_enable = ' --use_cuda={gpu_index}'.format(gpu_index=str(self._args.gpu_index))

--- a/superbench/config/amd_mi100_hpe.yaml
+++ b/superbench/config/amd_mi100_hpe.yaml
@@ -3,7 +3,7 @@
 # Server:
 #   - Product: HPE Apollo 6500
-version: v0.3
+version: v0.4
 superbench:
  enable: null
  var:
@@ -99,9 +99,31 @@ superbench:
        copy_type:
          - sm
          - dma
-    ort-inference:
+    ib-traffic:
-      <<: *default_local_mode
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            btl: tcp,self
+            pml: ob1
+            btl_tcp_if_include: ens17f0
+    gpcnet-network-test:
      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            pml: ucx
+            btl: ^uct
+            btl_tcp_if_include: ens17f0
+    tcp-connectivity:
+      enable: false
+      modes:
+        - name: local
+          parallel: no
+      parameters:
+        port: 22
    ort-models:
      enable: false
      modes:

--- a/superbench/config/amd_mi100_z53.yaml
+++ b/superbench/config/amd_mi100_z53.yaml
@@ -4,7 +4,7 @@
 #   - Product: G482-Z53
 #   - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
-version: v0.3
+version: v0.4
 superbench:
  enable: null
  var:

--- a/superbench/config/azure_ndmv4.yaml
+++ b/superbench/config/azure_ndmv4.yaml
@@ -3,9 +3,13 @@
 # Azure NDm A100 v4
 # reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series
-version: v0.3
+version: v0.4
 superbench:
  enable: null
+  monitor:
+    enable: true
+    sample_duration: 1
+    sample_interval: 10
  var:
    default_local_mode: &default_local_mode
      enable: true
@@ -123,6 +127,52 @@ superbench:
      <<: *default_pytorch_mode
    computation-communication-overlap:
      <<: *default_pytorch_mode
+    ib-traffic:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+    gpcnet-network-test:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            pml: ucx
+            btl: ^uct
+            btl_tcp_if_include: eth0
+    gpcnet-network-load-test:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            pml: ucx
+            btl: ^uct
+            btl_tcp_if_include: eth0
+    tcp-connectivity:
+      enable: false
+      modes:
+        - name: local
+          parallel: no
+      parameters:
+        port: 22
+    ort-inference:
+      <<: *default_local_mode
+    tensorrt-inference:
+      <<: *default_local_mode
+      parameters:
+        pytorch_models:
+          - resnet50
+          - resnet101
+          - resnet152
+          - densenet169
+          - densenet201
+          - bert-base
+          - bert-large
+        seq_length: 224
+        batch_size: 32
+        precision: int8
    gpt_models:
      <<: *default_pytorch_mode
      models:

--- a/superbench/config/azure_ndv4.yaml
+++ b/superbench/config/azure_ndv4.yaml
 # SuperBench Config
-version: v0.3
+version: v0.4
 superbench:
  enable: null
  monitor:
-    enable: false
+    enable: true
    sample_duration: 1
    sample_interval: 10
  var:
@@ -109,9 +109,52 @@ superbench:
      <<: *default_pytorch_mode
    computation-communication-overlap:
      <<: *default_pytorch_mode
+    ib-traffic:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+    gpcnet-network-test:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            pml: ucx
+            btl: ^uct
+            btl_tcp_if_include: eth0
+    gpcnet-network-load-test:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          mca:
+            pml: ucx
+            btl: ^uct
+            btl_tcp_if_include: eth0
+    tcp-connectivity:
+      enable: false
+      modes:
+        - name: local
+          parallel: no
+      parameters:
+        port: 22
    ort-inference:
      <<: *default_local_mode
-      enable: false
+    tensorrt-inference:
+      <<: *default_local_mode
+      parameters:
+        pytorch_models:
+          - resnet50
+          - resnet101
+          - resnet152
+          - densenet169
+          - densenet201
+          - bert-base
+          - bert-large
+        seq_length: 224
+        batch_size: 32
+        precision: int8
    gpt_models:
      <<: *default_pytorch_mode
      models: