Unverified Commit ff563b66 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.4.0 (#278)



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
parent 682ed06a
......@@ -7,15 +7,15 @@
[![Docker Pulls](https://img.shields.io/docker/pulls/superbench/superbench.svg)](https://hub.docker.com/r/superbench/superbench/tags)
[![License](https://img.shields.io/github/license/microsoft/superbenchmark.svg)](LICENSE)
| Azure Pipelines | Build Status |
| :---: | :---: |
| cpu-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main) |
| cuda-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main) |
| Azure Pipelines | Build Status |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cpu-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cpu-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=77&branchName=main) |
| cuda-unit-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/cuda-unit-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=80&branchName=main) |
| ansible-integration-test | [![Build Status](https://dev.azure.com/msrasrg/SuperBenchmark/_apis/build/status/ansible-integration-test?branchName=main)](https://dev.azure.com/msrasrg/SuperBenchmark/_build/latest?definitionId=82&branchName=main) |
__SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.3.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.3.0) has been released!
📢 [v0.4.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.4.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
......@@ -63,26 +63,26 @@ RUN mkdir -p /root/.ssh && \
echo -e "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
echo -e "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf
# Install OFED
ENV OFED_VERSION=5.2-2.2.3.0
RUN cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
# Install OpenMPI
ENV OPENMPI_VERSION=4.0.5
RUN cd /tmp && \
wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
cd openmpi-${OPENMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default && \
./configure --enable-orterun-prefix-by-default --with-ucx=/usr --enable-mca-no-build=btl-uct && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
# Install OFED
ENV OFED_VERSION=5.2-2.2.3.0
RUN cd /tmp && \
wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tgz && \
PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
rm -rf MLNX_OFED_LINUX-${OFED_VERSION}*
# Install HPC-X
RUN cd /opt && \
wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu18.04-x86_64.tbz && \
......
......@@ -69,7 +69,7 @@ RUN cd /tmp && \
wget -q https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
tar xzf openmpi-${OPENMPI_VERSION}.tar.gz && \
cd openmpi-${OPENMPI_VERSION} && \
./configure --enable-orterun-prefix-by-default && \
./configure --enable-orterun-prefix-by-default --with-ucx=/opt/ucx --enable-mca-no-build=btl-uct && \
make -j $(nproc) all && \
make install && \
ldconfig && \
......
......@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note
You should checkout corresponding tag to use release version, for example,
`git clone -b v0.3.0 https://github.com/microsoft/superbenchmark`
`git clone -b v0.4.0 https://github.com/microsoft/superbenchmark`
:::
```bash
......
......@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note
You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.3.0-cuda11.1.1`
`sb deploy -f local.ini -i superbench/superbench:v0.4.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
......@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'>
```yaml
version: v0.3
version: v0.4
superbench:
enable: benchmark_1
monitor:
......
......@@ -60,11 +60,40 @@ Large scale matmul operation using `torch.matmul` with one GPU.
### `cublas-function`
TODO
#### Introduction
Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuBLAS are as follows:
- cublasSgemm
- cublasSgemmStridedBatched
- cublasGemmStridedBatchedEx
- cublasGemmEx
- cublasCgemm3mStridedBatched
- cublasCgemm
#### Metrics
| Name | Unit | Description |
|----------------------------------------------------------|-----------|-------------------------------------------------------------------|
| cublas-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. |
### `cudnn-function`
TODO
#### Introduction
Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuDNN are as follows:
- cudnnConvolutionBackwardFilter
- cudnnConvolutionBackwardData
- cudnnConvolutionForward
#### Metrics
| Name | Unit | Description |
|---------------------------------------------------------|-----------|------------------------------------------------------------------|
| cudnn-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
### `tensorrt-inference`
......
......@@ -29,6 +29,7 @@ available tags are listed below for all stable versions.
| Tag | Description |
| ----------------- | ---------------------------------- |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
| v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
| v0.2.0-cuda11.1.1 | SuperBench v0.2.0 with CUDA 11.1.1 |
......@@ -38,6 +39,8 @@ available tags are listed below for all stable versions.
| Tag | Description |
| --------------------------- | ---------------------------------------------- |
| v0.4.0-rocm4.2-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.4.0-rocm4.0-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.0, PyTorch 1.7.0 |
| v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.3.0-rocm4.0-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.0, PyTorch 1.7.0 |
......
......@@ -64,7 +64,7 @@ superbench:
example:
```yaml
# SuperBench rules
version: v0.3
version: v0.4
superbench:
rules:
failure-rule:
......
......@@ -165,7 +165,7 @@ def run(self):
'pytest>=6.2.2',
'types-pyyaml',
'vcrpy>=4.1.1',
'yapf>=0.30.0',
'yapf==0.31.0',
],
'nvidia': ['py3nvml>=0.2.6'],
'ort': [
......
......@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems.
"""
__version__ = '0.3.0'
__version__ = '0.4.0'
__author__ = 'Microsoft'
......@@ -5,12 +5,13 @@
import re
from typing import Callable
from pathlib import Path
import pandas as pd
from superbench.common.utils import logger
from superbench.analyzer.diagnosis_rule_op import RuleOp, DiagnosisRuleType
import superbench.analyzer.file_handler as file_handler
from superbench.analyzer import file_handler
class DataDiagnosis():
......@@ -31,10 +32,15 @@ def _get_metrics_by_benchmarks(self, metrics_list):
"""
benchmarks_metrics = {}
for metric in metrics_list:
benchmark = metric.split('/')[0]
if benchmark not in benchmarks_metrics:
benchmarks_metrics[benchmark] = set()
benchmarks_metrics[benchmark].add(metric)
if '/' not in metric:
logger.warning(
'DataDiagnosis: get_metrics_by_benchmarks - {} does not have benchmark_name'.format(metric)
)
else:
benchmark = metric.split('/')[0]
if benchmark not in benchmarks_metrics:
benchmarks_metrics[benchmark] = set()
benchmarks_metrics[benchmark].add(metric)
return benchmarks_metrics
def _check_rules(self, rule, name):
......@@ -133,6 +139,7 @@ def _get_criteria(self, rule_file, baseline_file):
if re.search(metric_regex, metric):
self._sb_rules[rule]['metrics'][metric] = self._get_baseline_of_metric(baseline, metric)
self._enable_metrics.append(metric)
self._enable_metrics.sort()
except Exception as e:
logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e)))
return False
......@@ -171,8 +178,8 @@ def _run_diagnosis_rules_for_single_node(self, node):
issue_label = True
if issue_label:
# Add category information
general_cat_str = ','.join(categories)
details_cat_str = ','.join(details)
general_cat_str = ','.join(sorted(list(categories)))
details_cat_str = ','.join(sorted((details)))
details_row = [general_cat_str, details_cat_str]
return details_row, summary_data_row
......@@ -236,15 +243,15 @@ def run(self, raw_data_file, rule_file, baseline_file, output_dir, output_format
try:
self._raw_data_df = file_handler.read_raw_data(raw_data_file)
self._metrics = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns))
logger.info('DataDiagnosis: Begin to processe {} nodes'.format(len(self._raw_data_df)))
logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df)))
data_not_accept_df, label_df = self.run_diagnosis_rules(rule_file, baseline_file)
logger.info('DataDiagnosis: Processed finished')
outpout_path = ''
output_path = ''
if output_format == 'excel':
output_path = output_dir + '/diagnosis_summary.xlsx'
file_handler.output_excel(self._raw_data_df, data_not_accept_df, outpout_path, self._sb_rules)
output_path = str(Path(output_dir) / 'diagnosis_summary.xlsx')
file_handler.output_excel(self._raw_data_df, data_not_accept_df, output_path, self._sb_rules)
elif output_format == 'json':
output_path = output_dir + '/diagnosis_summary.jsonl'
output_path = str(Path(output_dir) / 'diagnosis_summary.jsonl')
file_handler.output_json_data_not_accept(data_not_accept_df, output_path)
else:
logger.error('DataDiagnosis: output failed - unsupported output format')
......
......@@ -129,10 +129,11 @@ def export_torchvision_model(self, model_name, batch_size=1):
if not self.check_torchvision_model(model_name):
return ''
file_name = str(self._onnx_model_path / (model_name + '.onnx'))
input_shape = (batch_size, 3, 224, 224)
model = getattr(torchvision.models, model_name)(pretrained=False).eval().cuda()
dummy_input = torch.randn((batch_size, 3, 224, 224), device='cuda')
torch.onnx.export(
getattr(torchvision.models, model_name)(pretrained=False).eval().cuda(),
torch.randn(input_shape, device='cuda'),
model,
dummy_input,
file_name,
opset_version=10,
operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,
......@@ -147,6 +148,10 @@ def export_torchvision_model(self, model_name, batch_size=1):
}
},
)
del model
del dummy_input
torch.cuda.empty_cache()
return file_name
def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
......@@ -163,13 +168,13 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
if not self.check_benchmark_model(model_name):
return
file_name = str(self._onnx_model_path / (model_name + '.onnx'))
input_shape, dtype = (batch_size, seq_length), torch.int64
model = self.benchmark_models[model_name]().eval().cuda()
dummy_input = torch.ones((batch_size, seq_length), dtype=torch.int64, device='cuda')
if model_name == 'lstm':
input_shape += (self.lstm_input_size, )
dtype = None
dummy_input = torch.ones((batch_size, seq_length, self.lstm_input_size), device='cuda')
torch.onnx.export(
self.benchmark_models[model_name]().eval().cuda(),
torch.ones(input_shape, dtype=dtype, device='cuda'),
model,
dummy_input,
file_name,
opset_version=10,
do_constant_folding=True,
......@@ -185,4 +190,8 @@ def export_benchmark_model(self, model_name, batch_size=1, seq_length=512):
}
},
)
del model
del dummy_input
torch.cuda.empty_cache()
return file_name
......@@ -291,8 +291,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
raw_data = raw_data.split(',')
raw_data.pop()
raw_data = [float(item) for item in raw_data]
self._result.add_result(metric, statistics.mean(raw_data))
self._result.add_raw_data(metric, raw_data)
self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data))
self._result.add_raw_data(metric.lower() + '_time', raw_data)
if 'Error' in line:
error = True
except BaseException as e:
......
......@@ -6,6 +6,7 @@
import os
import json
import yaml
import statistics
from superbench.common.utils import logger
from superbench.benchmarks import Platform, BenchmarkRegistry, ReturnCode
......@@ -424,8 +425,8 @@ def _process_raw_result(self, cmd_idx, raw_output):
raw_data = raw_data.split(',')
raw_data.pop()
raw_data = [float(item) for item in raw_data]
self._result.add_result(metric, sum(raw_data) / len(raw_data))
self._result.add_raw_data(metric, raw_data)
self._result.add_result(metric.lower() + '_time', statistics.mean(raw_data) * 1000)
self._result.add_raw_data(metric.lower() + '_time', raw_data)
if 'Error' in line:
error = True
except BaseException as e:
......
......@@ -249,7 +249,7 @@ def __prepare_general_ib_command_params(self):
msg_size = '-s ' + str(self._args.msg_size)
# Add GPUDirect for ib command
gpu_enable = ''
if self._args.gpu_index:
if self._args.gpu_index is not None:
gpu = GPU()
if gpu.vendor == 'nvidia':
gpu_enable = ' --use_cuda={gpu_index}'.format(gpu_index=str(self._args.gpu_index))
......
......@@ -3,7 +3,7 @@
# Server:
# - Product: HPE Apollo 6500
version: v0.3
version: v0.4
superbench:
enable: null
var:
......@@ -99,9 +99,31 @@ superbench:
copy_type:
- sm
- dma
ort-inference:
<<: *default_local_mode
ib-traffic:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
btl: tcp,self
pml: ob1
btl_tcp_if_include: ens17f0
gpcnet-network-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: ens17f0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-models:
enable: false
modes:
......
......@@ -4,7 +4,7 @@
# - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.3
version: v0.4
superbench:
enable: null
var:
......
......@@ -3,9 +3,13 @@
# Azure NDm A100 v4
# reference: https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series
version: v0.3
version: v0.4
superbench:
enable: null
monitor:
enable: true
sample_duration: 1
sample_interval: 10
var:
default_local_mode: &default_local_mode
enable: true
......@@ -123,6 +127,52 @@ superbench:
<<: *default_pytorch_mode
computation-communication-overlap:
<<: *default_pytorch_mode
ib-traffic:
enable: false
modes:
- name: mpi
proc_num: 1
gpcnet-network-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
gpcnet-network-load-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-inference:
<<: *default_local_mode
tensorrt-inference:
<<: *default_local_mode
parameters:
pytorch_models:
- resnet50
- resnet101
- resnet152
- densenet169
- densenet201
- bert-base
- bert-large
seq_length: 224
batch_size: 32
precision: int8
gpt_models:
<<: *default_pytorch_mode
models:
......
# SuperBench Config
version: v0.3
version: v0.4
superbench:
enable: null
monitor:
enable: false
enable: true
sample_duration: 1
sample_interval: 10
var:
......@@ -109,9 +109,52 @@ superbench:
<<: *default_pytorch_mode
computation-communication-overlap:
<<: *default_pytorch_mode
ib-traffic:
enable: false
modes:
- name: mpi
proc_num: 1
gpcnet-network-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
gpcnet-network-load-test:
enable: false
modes:
- name: mpi
proc_num: 1
mca:
pml: ucx
btl: ^uct
btl_tcp_if_include: eth0
tcp-connectivity:
enable: false
modes:
- name: local
parallel: no
parameters:
port: 22
ort-inference:
<<: *default_local_mode
enable: false
tensorrt-inference:
<<: *default_local_mode
parameters:
pytorch_models:
- resnet50
- resnet101
- resnet152
- densenet169
- densenet201
- bert-base
- bert-large
seq_length: 224
batch_size: 32
precision: int8
gpt_models:
<<: *default_pytorch_mode
models:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment