Unverified Commit ff563b66 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.4.0 (#278)



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
parent 682ed06a
...@@ -7,15 +7,15 @@ ...@@ -7,15 +7,15 @@
[![Docker Pulls](https://img.shields.io/docker/pulls/superbench/superbench.svg)](https://hub.docker.com/r/superbench/superbench/tags) [![Docker Pulls](https://img.shields.io/docker/pulls/superbench/superbench.svg)](https://hub.docker.com/r/superbench/superbench/tags)
[![License](https://img.shields.io/github/license/microsoft/superbenchmark.svg)](LICENSE) [![License](https://img.shields.io/github/license/microsoft/superbenchmark.svg)](LICENSE)
| Azure Pipelines | Build Status | | Azure Pipelines | Build Status |
| :---: | :---: | |--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cpu-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main) | | cpu-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main) |
| cuda-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main) | | cuda-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main) |
| ansible-integration-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/ansible-integration-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=82&branchName=main) | | ansible-integration-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/ansible-integration-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=82&branchName=main) |
__SuperBench__ is a validation and profiling tool for AI infrastructure. __SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.3.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.3.0) has been released! 📢 [v0.4.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.4.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._ ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
...@@ -63,26 +63,26 @@ RUN mkdir -p /root/.ssh && \ ...@@ -63,26 +63,26 @@ RUN mkdir -p /root/.ssh && \
echo -e "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \ echo -e "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
echo -e "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf echo -e "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf
# Install OFED
ENV OFED_VERSION=5.2-2.2.3.0
RUN cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
# Install OpenMPI # Install OpenMPI
ENV OPENMPI_VERSION=4.0.5 ENV OPENMPI_VERSION=4.0.5
RUN cd /tmp && \ RUN cd /tmp && \
wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \ wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \ tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
cd openmpi-${OPENMPI_VERSION} && \ cd openmpi-${OPENMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default && \ ./configure --enable-orterun-prefix-by-default --with-ucx=/usr --enable-mca-no-build=btl-uct && \
make -j $(nproc) all && \ make -j $(nproc) all && \
make install && \ make install && \
ldconfig && \ ldconfig && \
rm -rf /tmp/openmpi-${OPENMPI_VERSION}* rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
# Install OFED
ENV OFED_VERSION=5.2-2.2.3.0
RUN cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
# Install HPC-X # Install HPC-X
RUN cd /opt && \ RUN cd /opt && \
wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tbz && \ wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tbz && \
......
...@@ -69,7 +69,7 @@ RUN cd /tmp && \ ...@@ -69,7 +69,7 @@ RUN cd /tmp && \
wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \ wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \ tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
cd openmpi-${OPENMPI_VERSION} && \ cd openmpi-${OPENMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default && \ ./configure --enable-orterun-prefix-by-default --with-ucx=/opt/ucx --enable-mca-no-build=btl-uct && \
make -j $(nproc) all && \ make -j $(nproc) all && \
make install && \ make install && \
ldconfig && \ ldconfig && \
......
...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it. ...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note :::note Note
You should checkout corresponding tag to use release version, for example, You should checkout corresponding tag to use release version, for example,
`git clone -b v0.3.0 https://github.com/microsoft/superbenchmark` `git clone -b v0.4.0 https://github.com/microsoft/superbenchmark`
::: :::
```bash ```bash
......
...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password] ...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note :::note Note
You should deploy corresponding Docker image to use release version, for example, You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.3.0-cuda11.1.1` `sb deploy -f local.ini -i superbench/superbench:v0.4.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone. You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
...@@ -70,7 +70,7 @@ superbench: ...@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'> <TabItem value='example'>
```yaml ```yaml
version: v0.3 version: v0.4
superbench: superbench:
enable: benchmark_1 enable: benchmark_1
monitor: monitor:
......
...@@ -60,11 +60,40 @@ Large scale matmul operation using `torch.matmul` with one GPU. ...@@ -60,11 +60,40 @@ Large scale matmul operation using `torch.matmul` with one GPU.
### `cublas-function` ### `cublas-function`
TODO #### Introduction
Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuBLAS are as follows:
- cublasSgemm
- cublasSgemmStridedBatched
- cublasGemmStridedBatchedEx
- cublasGemmEx
- cublasCgemm3mStridedBatched
- cublasCgemm
#### Metrics
| Name | Unit | Description |
|----------------------------------------------------------|-----------|-------------------------------------------------------------------|
| cublas-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. |
### `cudnn-function` ### `cudnn-function`
TODO #### Introduction
Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuDNN are as follows:
- cudnnConvolutionBackwardFilter
- cudnnConvolutionBackwardData
- cudnnConvolutionForward
#### Metrics
| Name | Unit | Description |
|---------------------------------------------------------|-----------|------------------------------------------------------------------|
| cudnn-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
### `tensorrt-inference` ### `tensorrt-inference`
......
...@@ -29,6 +29,7 @@ available tags are listed below for all stable versions. ...@@ -29,6 +29,7 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
| ----------------- | ---------------------------------- | | ----------------- | ---------------------------------- |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 | | v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
| v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 | | v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
| v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 | | v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 |
...@@ -38,6 +39,8 @@ available tags are listed below for all stable versions. ...@@ -38,6 +39,8 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
| --------------------------- | ---------------------------------------------- | | --------------------------- | ---------------------------------------------- |
| v0.4.0-rocm4.2-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.4.0-rocm4.0-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.0, PyTorch 1.7.0 |
| v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 | | v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.3.0-rocm4.0-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.0, PyTorch 1.7.0 | | v0.3.0-rocm4.0-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.0, PyTorch 1.7.0 |
......
...@@ -64,7 +64,7 @@ superbench: ...@@ -64,7 +64,7 @@ superbench:
example: example:
```yaml ```yaml
# SuperBench rules # SuperBench rules
version: v0.3 version: v0.4
superbench: superbench:
rules: rules:
failure-rule: failure-rule:
......
...@@ -165,7 +165,7 @@ def run(self): ...@@ -165,7 +165,7 @@ def run(self):
'pytest>=6.2.2', 'pytest>=6.2.2',
'types-pyyaml', 'types-pyyaml',
'vcrpy>=4.1.1', 'vcrpy>=4.1.1',
'yapf>=0.30.0', 'yapf==0.31.0',
], ],
'nvidia': ['py3nvml>=0.2.6'], 'nvidia': ['py3nvml>=0.2.6'],
'ort': [ 'ort': [
......
...@@ -6,5 +6,5 @@ ...@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems. Provide hardware and software benchmarks for AI systems.
""" """
__version__ = '0.3.0' __version__ = '0.4.0'
__author__ = 'Microsoft' __author__ = 'Microsoft'
...@@ -5,12 +5,13 @@ ...@@ -5,12 +5,13 @@
import re import re
from typing import Callable from typing import Callable
from pathlib import Path
import pandas as pd import pandas as pd
from superbench.common.utils import logger from superbench.common.utils import logger
from superbench.analyzer.diagnosis_rule_op import RuleOp, DiagnosisRuleType from superbench.analyzer.diagnosis_rule_op import RuleOp, DiagnosisRuleType
import superbench.analyzer.file_handler as file_handler from superbench.analyzer import file_handler
class DataDiagnosis(): class DataDiagnosis():
...@@ -31,10 +32,15 @@ def _get_metrics_by_benchmarks(self, metrics_list): ...@@ -31,10 +32,15 @@ def _get_metrics_by_benchmarks(self, metrics_list):
""" """
benchmarks_metrics = {} benchmarks_metrics = {}
for metric in metrics_list: for metric in metrics_list:
benchmark = metric.split('/')[0] if '/' not in metric:
if benchmark not in benchmarks_metrics: logger.warning(
benchmarks_metrics[benchmark] = set() 'DataDiagnosis: get_metrics_by_benchmarks - {} does not have benchmark_name'.format(metric)
benchmarks_metrics[benchmark].add(metric) )
else:
benchmark = metric.split('/')[0]
if benchmark not in benchmarks_metrics:
benchmarks_metrics[benchmark] = set()
benchmarks_metrics[benchmark].add(metric)
return benchmarks_metrics return benchmarks_metrics
def _check_rules(self, rule, name): def _check_rules(self, rule, name):
...@@ -133,6 +139,7 @@ def _get_criteria(self, rule_file, baseline_file): ...@@ -133,6 +139,7 @@ def _get_criteria(self, rule_file, baseline_file):
if re.search(metric_regex, metric): if re.search(metric_regex, metric):
self._sb_rules[rule]['metrics'][metric] = self._get_baseline_of_metric(baseline, metric) self._sb_rules[rule]['metrics'][metric] = self._get_baseline_of_metric(baseline, metric)
self._enable_metrics.append(metric) self._enable_metrics.append(metric)
self._enable_metrics.sort()
except Exception as e: except Exception as e:
logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e))) logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e)))
return False return False
...@@ -171,8 +178,8 @@ def _run_diagnosis_rules_for_single_node(self, node): ...@@ -171,8 +178,8 @@ def _run_diagnosis_rules_for_single_node(self, node):
issue_label = True issue_label = True
if issue_label: if issue_label:
# Add category information # Add category information
general_cat_str = ','.join(categories) general_cat_str = ','.join(sorted(list(categories)))
details_cat_str = ','.join(details) details_cat_str = ','.join(sorted((details)))
details_row = [general_cat_str, details_cat_str] details_row = [general_cat_str, details_cat_str]
return details_row, summary_data_row return details_row, summary_data_row
...@@ -236,15 +243,15 @@ def run(self, raw_data_file, rule_file, baseline_file, output_dir, output_format ...@@ -236,15 +243,15 @@ def run(self, raw_data_file, rule_file, baseline_file, output_dir, output_format
try: try:
self._raw_data_df = file_handler.read_raw_data(raw_data_file) self._raw_data_df = file_handler.read_raw_data(raw_data_file)
self._metrics = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns)) self._metrics = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns))
logger.info('DataDiagnosis: Begin to processe {} nodes'.format(len(self._raw_data_df))) logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df)))
data_not_accept_df, label_df = self.run_diagnosis_rules(rule_file, baseline_file) data_not_accept_df, label_df = self.run_diagnosis_rules(rule_file, baseline_file)
logger.info('DataDiagnosis: Processed finished') logger.info('DataDiagnosis: Processed finished')
outpout_path = '' output_path = ''
if output_format == 'excel': if output_format == 'excel':
output_path = output_dir + '/diagnosis_summary.xlsx' output_path = str(Path(output_dir) / 'diagnosis_summary.xlsx')
file_handler.output_excel(self._raw_data_df, data_not_accept_df, outpout_path, self._sb_rules) file_handler.output_excel(self._raw_data_df, data_not_accept_df, output_path, self._sb_rules)
elif output_format == 'json': elif output_format == 'json':
output_path = output_dir + '/diagnosis_summary.jsonl' output_path = str(Path(output_dir) / 'diagnosis_summary.jsonl')
file_handler.output_json_data_not_accept(data_not_accept_df, output_path) file_handler.output_json_data_not_accept(data_not_accept_df, output_path)
else: else:
logger.error('DataDiagnosis: output failed - unsupported output format') logger.error('DataDiagnosis: output failed - unsupported output format')
......
...@@ -129,10 +129,11 @@ def export_torchvision_model(self, model_name, batch_size=1): ...@@ -129,10 +129,11 @@ def export_torchvision_model(self, model_name, batch_size=1):
if not self.check_torchvision_model(model_name): if not self.check_torchvision_model(model_name):
return '' return ''
file_name = str(self._onnx_model_path / (model_name + '.onnx')) file_name = str(self._onnx_model_path / (model_name + '.onnx'))
input_shape = (batch_size, 3, 224, 224) model = getattr(torchvision.models, model_name)(pretrained=False).eval().cuda()
dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda')
torch.onnx.export( torch.onnx.export(
getattr(torchvision.models, model_name)(pretrained=False).eval().cuda(), model,
torch.randn(input_shape, device='cuda'), dummy_input,
file_name, file_name,
opset_version=10, opset_version=10,
operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
...@@ -147,6 +148,10 @@ def export_torchvision_model(self, model_name, batch_size=1): ...@@ -147,6 +148,10 @@ def export_torchvision_model(self, model_name, batch_size=1):
} }
}, },
) )
del model
del dummy_input
torch.cuda.empty_cache()
return file_name return file_name
def export_benchmark_model(self, model_name, batch_size=1, seq_length=512): def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
...@@ -163,13 +168,13 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512): ...@@ -163,13 +168,13 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
if not self.check_benchmark_model(model_name): if not self.check_benchmark_model(model_name):
return return
file_name = str(self._onnx_model_path / (model_name + '.onnx')) file_name = str(self._onnx_model_path / (model_name + '.onnx'))
input_shape, dtype = (batch_size, seq_length), torch.int64 model = self.benchmark_models[model_name]().eval().cuda()
dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda')
if model_name == 'lstm': if model_name == 'lstm':
input_shape += (self.lstm_input_size, ) dummy_input = torch.ones((batch_size, seq_length, self.lstm_input_size), device='cuda')
dtype = None
torch.onnx.export( torch.onnx.export(
self.benchmark_models[model_name]().eval().cuda(), model,
torch.ones(input_shape, dtype=dtype, device='cuda'), dummy_input,
file_name, file_name,
opset_version=10, opset_version=10,
do_constant_folding=True, do_constant_folding=True,
...@@ -185,4 +190,8 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512): ...@@ -185,4 +190,8 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
} }
}, },
) )
del model
del dummy_input
torch.cuda.empty_cache()
return file_name return file_name
...@@ -291,8 +291,8 @@ def _process_raw_result(self, cmd_idx, raw_output): ...@@ -291,8 +291,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
raw_data = raw_data.split(',') raw_data = raw_data.split(',')
raw_data.pop() raw_data.pop()
raw_data = [float(item) for item in raw_data] raw_data = [float(item) for item in raw_data]
self._result.add_result(metric, statistics.mean(raw_data)) self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data))
self._result.add_raw_data(metric, raw_data) self._result.add_raw_data(metric.lower() + '_time', raw_data)
if 'Error' in line: if 'Error' in line:
error = True error = True
except BaseException as e: except BaseException as e:
......
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
import os import os
import json import json
import yaml import yaml
import statistics
from superbench.common.utils import logger from superbench.common.utils import logger
from superbench.benchmarks import Platform, BenchmarkRegistry, ReturnCode from superbench.benchmarks import Platform, BenchmarkRegistry, ReturnCode
...@@ -424,8 +425,8 @@ def _process_raw_result(self, cmd_idx, raw_output): ...@@ -424,8 +425,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
raw_data = raw_data.split(',') raw_data = raw_data.split(',')
raw_data.pop() raw_data.pop()
raw_data = [float(item) for item in raw_data] raw_data = [float(item) for item in raw_data]
self._result.add_result(metric, sum(raw_data) / len(raw_data)) self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data) * 1000)
self._result.add_raw_data(metric, raw_data) self._result.add_raw_data(metric.lower() + '_time', raw_data)
if 'Error' in line: if 'Error' in line:
error = True error = True
except BaseException as e: except BaseException as e:
......
...@@ -249,7 +249,7 @@ def __prepare_general_ib_command_params(self): ...@@ -249,7 +249,7 @@ def __prepare_general_ib_command_params(self):
msg_size = '-s ' + str(self._args.msg_size) msg_size = '-s ' + str(self._args.msg_size)
# Add GPUDirect for ib command # Add GPUDirect for ib command
gpu_enable = '' gpu_enable = ''
if self._args.gpu_index: if self._args.gpu_index is not None:
gpu = GPU() gpu = GPU()
if gpu.vendor == 'nvidia': if gpu.vendor == 'nvidia':
gpu_enable = ' --use_cuda={gpu_index}'.format(gpu_index=str(self._args.gpu_index)) gpu_enable = ' --use_cuda={gpu_index}'.format(gpu_index=str(self._args.gpu_index))
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Server: # Server:
# - Product: HPE Apollo 6500 # - Product: HPE Apollo 6500
version: v0.3 version: v0.4
superbench: superbench:
enable: null enable: null
var: var:
...@@ -99,9 +99,31 @@ superbench: ...@@ -99,9 +99,31 @@ superbench:
copy_type: copy_type:
- sm - sm
- dma - dma
ort-inference: ib-traffic:
<<: *default_local_mode enable: false
modes:
- name: mpi
proc_num: 1
mca:
btl: tcp,self
pml: ob1
btl_tcp_if_include: ens17f0
gpcnet-network-test:
enable: false enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: ens17f0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-models: ort-models:
enable: false enable: false
modes: modes:
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
# - Product: G482-Z53 # - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html # - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.3 version: v0.4
superbench: superbench:
enable: null enable: null
var: var:
......
...@@ -3,9 +3,13 @@ ...@@ -3,9 +3,13 @@
# Azure NDm A100 v4 # Azure NDm A100 v4
# reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series # reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series
version: v0.3 version: v0.4
superbench: superbench:
enable: null enable: null
monitor:
enable: true
sample_duration: 1
sample_interval: 10
var: var:
default_local_mode: &default_local_mode default_local_mode: &default_local_mode
enable: true enable: true
...@@ -123,6 +127,52 @@ superbench: ...@@ -123,6 +127,52 @@ superbench:
<<: *default_pytorch_mode <<: *default_pytorch_mode
computation-communication-overlap: computation-communication-overlap:
<<: *default_pytorch_mode <<: *default_pytorch_mode
ib-traffic:
enable: false
modes:
- name: mpi
proc_num: 1
gpcnet-network-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
gpcnet-network-load-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-inference:
<<: *default_local_mode
tensorrt-inference:
<<: *default_local_mode
parameters:
pytorch_models:
- resnet50
- resnet101
- resnet152
- densenet169
- densenet201
- bert-base
- bert-large
seq_length: 224
batch_size: 32
precision: int8
gpt_models: gpt_models:
<<: *default_pytorch_mode <<: *default_pytorch_mode
models: models:
......
# SuperBench Config # SuperBench Config
version: v0.3 version: v0.4
superbench: superbench:
enable: null enable: null
monitor: monitor:
enable: false enable: true
sample_duration: 1 sample_duration: 1
sample_interval: 10 sample_interval: 10
var: var:
...@@ -109,9 +109,52 @@ superbench: ...@@ -109,9 +109,52 @@ superbench:
<<: *default_pytorch_mode <<: *default_pytorch_mode
computation-communication-overlap: computation-communication-overlap:
<<: *default_pytorch_mode <<: *default_pytorch_mode
ib-traffic:
enable: false
modes:
- name: mpi
proc_num: 1
gpcnet-network-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
gpcnet-network-load-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-inference: ort-inference:
<<: *default_local_mode <<: *default_local_mode
enable: false tensorrt-inference:
<<: *default_local_mode
parameters:
pytorch_models:
- resnet50
- resnet101
- resnet152
- densenet169
- densenet201
- bert-base
- bert-large
seq_length: 224
batch_size: 32
precision: int8
gpt_models: gpt_models:
<<: *default_pytorch_mode <<: *default_pytorch_mode
models: models:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment