Unverified Commit 63e9b2d1 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.6.0 (#409)



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
parent 733860d7
......@@ -15,7 +15,7 @@
__SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.5.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.5.0) has been released!
📢 [v0.6.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.6.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
......@@ -26,6 +26,7 @@ RUN apt-get update && \
curl \
dmidecode \
git \
iproute2 \
jq \
libaio-dev \
libcap2 \
......@@ -38,6 +39,7 @@ RUN apt-get update && \
openssh-client \
openssh-server \
pciutils \
sudo \
util-linux \
vim \
wget \
......
......@@ -31,6 +31,7 @@ RUN apt-get update && \
dmidecode \
git \
hipify-clang \
iproute2 \
jq \
libaio-dev \
libboost-program-options-dev \
......@@ -46,6 +47,7 @@ RUN apt-get update && \
openssh-server \
pciutils \
rsync \
sudo \
util-linux \
vim \
wget \
......
......@@ -30,6 +30,7 @@ RUN apt-get update && \
dmidecode \
git \
hipify-clang \
iproute2 \
jq \
libaio-dev \
libboost-program-options-dev \
......@@ -46,6 +47,7 @@ RUN apt-get update && \
openssh-server \
pciutils \
rsync \
sudo \
util-linux \
vim \
wget \
......
......@@ -180,16 +180,16 @@ sb result diagnosis --baseline-file
#### Required arguments
| Name | Description |
|------------------------|------------------------|
| `--baseline-file` `-b` | Path to baseline file. |
| `--data-file` `-d` | Path to raw data file. |
| `--rule-file` `-r` | Path to rule file. |
| Name | Description |
|--------------------|------------------------|
| `--data-file` `-d` | Path to raw data file. |
| `--rule-file` `-r` | Path to rule file. |
#### Optional arguments
| Name | Default | Description |
|-------------------------|---------|-----------------------------------------------------------------------------|
| `--baseline-file` `-b` | Path to baseline file. |
| `--decimal-place-value` | 2 | Number of valid decimal places to show in output. Default: 2. |
| `--output-all` | N/A | Output diagnosis results for all nodes. |
| `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. |
......
......@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note
You should checkout corresponding tag to use release version, for example,
`git clone -b v0.5.0 https://github.com/microsoft/superbenchmark`
`git clone -b v0.6.0 https://github.com/microsoft/superbenchmark`
:::
```bash
......@@ -96,7 +96,7 @@ Here're the system requirements for all managed GPU nodes.
* Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later.
* Compatible GPU drivers should be installed correctly. Driver version can be checked by running `nvidia-smi`.
* [Docker CE](https://docs.docker.com/engine/install/) version 19.03 or later (which can be checked by running `docker --version`).
* [Docker CE](https://docs.docker.com/engine/install/) version 20.10 or later (which can be checked by running `docker --version`).
* NVIDIA GPU support in Docker, install
[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit).
......@@ -106,7 +106,7 @@ Here're the system requirements for all managed GPU nodes.
* Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later.
* Compatible GPU drivers should be installed correctly, and group permission should be set to access GPU resources.
You should be able to run `rocm-smi` and `rocminfo` directly to check GPU usage and information.
* [Docker CE](https://docs.docker.com/engine/install/) version 19.03 or later (which can be checked by running `docker --version`).
* [Docker CE](https://docs.docker.com/engine/install/) version 20.10 or later (which can be checked by running `docker --version`).
</TabItem>
</Tabs>
......@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note
You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.5.0-cuda11.1.1`
`sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
......@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'>
```yaml
version: v0.5
version: v0.6
superbench:
enable: benchmark_1
monitor:
......
......@@ -29,6 +29,7 @@ available tags are listed below for all stable versions.
| Tag | Description |
|-------------------|------------------------------------|
| v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 |
| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
......@@ -40,6 +41,10 @@ available tags are listed below for all stable versions.
| Tag | Description |
|-------------------------------|--------------------------------------------------|
| v0.6.0-rocm5.1.3 | SuperBench v0.6.0 with ROCm 5.1.3 |
| v0.6.0-rocm5.1.1 | SuperBench v0.6.0 with ROCm 5.1.1 |
| v0.6.0-rocm5.0.1 | SuperBench v0.6.0 with ROCm 5.0.1 |
| v0.6.0-rocm5.0 | SuperBench v0.6.0 with ROCm 5.0 |
| v0.5.0-rocm5.0.1-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0.1, PyTorch 1.9.0 |
| v0.5.0-rocm5.0-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0, PyTorch 1.9.0 |
| v0.5.0-rocm4.2-pytorch1.7.0 | SuperBench v0.5.0 with ROCm 4.2, PyTorch 1.7.0 |
......
......@@ -32,7 +32,7 @@ The input mainly includes 3 files:
- **rule file**: It uses YAML format and includes each metrics' rules to filter defective machines for diagnosis.
- **baseline file**: json file including the baseline values for the metrics.
- **baseline file (optional)**: json file including the baseline values for the metrics.
`Tips`: this file for some representative machine types will be published in [SuperBench Results Repo](https://github.com/microsoft/superbench-results/tree/main) with the release of Superbench.
......@@ -52,8 +52,8 @@ superbench:
${var_name}: dict
rules:
${rule_name}:
function: string
criteria: string
function: (optional)string
criteria: (optional)string
store: (optional)bool
categories: string
metrics:
......@@ -65,11 +65,11 @@ superbench:
example:
```yaml
# SuperBench rules
version: v0.5
version: v0.6
superbench:
rules:
failure-rule:
function: value
function: failure_check
criteria: lambda x:x>0
categories: Failed
metrics:
......@@ -125,8 +125,17 @@ superbench:
- vgg_models/pytorch-vgg.*/throughput_train_.*\
rule6:
function: multi_rules
criteria: 'lambda label:True if label["rule4"]+label["rule5"]>=2 else False'
criteria: 'lambda label: bool(label["rule4"]+label["rule5"]>=2)'
categories: CNN
rule7:
categories: MODEL_DIST
store: True
metrics:
- model-benchmarks:stress-run.*/pytorch-gpt2-large/fp32_train_throughput
rule8:
function: multi_rules
criteria: 'lambda label: bool(min(label["rule7"].values()))<1)'
categories: MODEL_DIST
```
This rule file describes the rules used for data diagnosis.
......@@ -147,15 +156,18 @@ The criterion used for this rule, which indicates how to compare the data with t
#### `store`
True if the current rule is not used alone to filter the defective machine, but will be used by other subsequent rules. False(default) if this rule is used to label the defective machine directly.
- True: this rule is used to store metrics which will be used by other subsequent rules.
- If store is True and criteria/function are not None in the rule, it will store how many metrics in this rule meet the criteria into lable["rule_name"], for example lable["rule_name"]=2 means 2 metrics are identified as defective in this rule;
- If store is True and criteria/function are None, it will store the dict of {metric_name: values} of the metrics into lable["rule_name"]
- False (default): this rule is used to label the defective machine directly.
#### `function`
The function used for this rule.
3 types of rules are supported currently:
The supported functions are listed as follows:
- `variance`: the rule is to check if the variance between raw data and baseline violates the criteria. variance = (raw data - criteria) / criteria
- `variance`: the rule is to check if the variance between raw data and baseline violates the criteria. variance = (raw data - baseline) / baseline
For example, if the 'criteria' is `lambda x:x>0.05`, the rule is that if the variance is larger than 5%, it should be defective.
......@@ -164,8 +176,16 @@ The function used for this rule.
For example, if the 'criteria' is `lambda x:x>0`, the rule is that if the raw data is larger than the 0, it should be defective.
- `multi_rules`: the rule is to check if the combined results of multiple previous rules and metrics violate the criteria.
We would like to list several examples as follows:
- `criteria: lambda label: bool(label["rule4"]+label["rule5"]>=2)` means that this rule will be triggered if the sum of labeled metrics in rule4 and rule5 is larger than 2
- `criteria: lambda label: bool(min(label["rule7"].values()))<1)` means that if the minimum of the metrics' values in rule6 is smaller than 1, it should be defective.
- If you reference a non-existent rule, it will raise exception.
- If the test in the referenced rule failed or not run resulting in exception in creteria, it will not raise exception since it will be checked in failure_rule.
For example, if the 'criteria' is 'lambda label:True if label["rule4"]+label["rule5"]>=2 else False', the rule is that if the sum of labeled metrics in rule4 and rule5 is larger than 2, it should be defective.
- `failure_check`: the rule is to check if any metric in this rule fail or miss the test. The metrics in this rule should be like `{benchmark_name}/.*:return_code` used to identify the failure.
- If any item is never matched with the metrics of the raw data, the rule will identify it as miss test.
- If any metric violate the `value` criteria which means return_code is not success(0), the rule will identify it as failed test.
`Tips`: you must contain a default rule for ${benchmark_name}/return_code as the above in the example, which is used to identify failed tests.
......@@ -182,6 +202,8 @@ The output includes all defective machines' information including index, failure
- Defective Details (diagnosis/issue_details in json format): all violated metrics including metric data and related rule.
- ${metric}: the data of the metrics defined in the rule file. If the rule is `variance`, the form of the data is variance in percentage; if the rule is `value`, the form of the data is raw data.
- `'N/A'` indicates a empty value for the metric in output files.
If you specify '--output-all' in the command, the output includes all machines' information and an extra field to indicate if the machines is defective.
......
......@@ -58,7 +58,7 @@ superbench:
```yaml title="Example"
# SuperBench rules
version: v0.5
version: v0.6
superbench:
rules:
kernel_launch:
......
......@@ -142,7 +142,7 @@ def run(self):
install_requires=[
'ansible_base>=2.10.9;os_name=="posix"',
'ansible_runner>=2.0.0rc1',
'colorlog>=4.7.2',
'colorlog>=6.7.0',
'jinja2>=2.10.1',
'joblib>=1.0.1',
'jsonlines>=2.0.0',
......@@ -155,6 +155,7 @@ def run(self):
'omegaconf==2.0.6',
'openpyxl>=3.0.7',
'pandas>=1.1.5',
'pssh @ git+https://github.com/lilydjwg/pssh.git@v2.3.4',
'pyyaml>=5.3',
'requests>=2.27.1',
'seaborn>=0.11.2',
......@@ -169,8 +170,8 @@ def run(self):
**x,
'develop': x['dev'] + x['test'],
'cpuworker': x['torch'],
'amdworker': x['torch'] + x['ort'] + x['mpi'],
'nvworker': x['torch'] + x['ort'] + x['mpi'] + x['nvidia'],
'amdworker': x['torch'] + x['ort'],
'nvworker': x['torch'] + x['ort'] + x['nvidia'],
}
)(
{
......@@ -199,7 +200,6 @@ def run(self):
'onnx>=1.10.2',
'onnxruntime-gpu==1.10.0',
],
'mpi': ['mpi4py>=3.1.3'],
'nvidia': ['py3nvml>=0.2.6'],
}
),
......
......@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems.
"""
__version__ = '0.5.0'
__version__ = '0.6.0'
__author__ = 'Microsoft'
......@@ -21,6 +21,7 @@ class DataDiagnosis(RuleBase):
def __init__(self):
"""Init function."""
super().__init__()
self.na = 'N/A'
def _check_and_format_rules(self, rule, name):
"""Check the rule of the metric whether the formart is valid.
......@@ -63,8 +64,6 @@ def _get_baseline_of_metric(self, baseline, metric):
"""
if metric in baseline:
return baseline[metric]
elif 'return_code' in metric:
return 0
else:
short = metric
# exclude rank info, for example, '.*:\d+'->'.*'
......@@ -76,8 +75,7 @@ def _get_baseline_of_metric(self, baseline, metric):
return baseline[short]
# baseline not defined
else:
logger.warning('DataDiagnosis: get baseline - {} baseline not found'.format(metric))
return -1
return None
def __get_metrics_and_baseline(self, rule, benchmark_rules, baseline):
"""Get metrics with baseline in the rule.
......@@ -108,8 +106,7 @@ def _parse_rules_and_baseline(self, rules, baseline):
"""
try:
if not rules:
logger.error('DataDiagnosis: get criteria failed')
return False
logger.log_and_raise(exception=Exception, msg='DataDiagnosis: get criteria failed')
self._sb_rules = {}
self._enable_metrics = set()
benchmark_rules = rules['superbench']['rules']
......@@ -129,8 +126,7 @@ def _parse_rules_and_baseline(self, rules, baseline):
self.__get_metrics_and_baseline(rule, benchmark_rules, baseline)
self._enable_metrics = sorted(list(self._enable_metrics))
except Exception as e:
logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e)))
return False
logger.log_and_raise(exception=Exception, msg='DataDiagnosis: get criteria failed - {}'.format(str(e)))
return True
......@@ -205,32 +201,29 @@ def run_diagnosis_rules(self, rules, baseline):
data_not_accept_df (DataFrame): defective nodes's detailed information
label_df (DataFrame): labels for all nodes
"""
try:
summary_columns = ['Category', 'Defective Details']
data_not_accept_df = pd.DataFrame(columns=summary_columns)
summary_details_df = pd.DataFrame()
label_df = pd.DataFrame(columns=['label'])
if not self._parse_rules_and_baseline(rules, baseline):
return data_not_accept_df, label_df
# run diagnosis rules for each node
for node in self._raw_data_df.index:
details_row, summary_data_row = self._run_diagnosis_rules_for_single_node(node)
if details_row:
data_not_accept_df.loc[node] = details_row
summary_details_df = pd.concat(
[summary_details_df,
pd.DataFrame([summary_data_row.to_dict()], index=[summary_data_row.name])]
)
label_df.loc[node] = 1
else:
label_df.loc[node] = 0
# combine details for defective nodes
if len(data_not_accept_df) != 0:
data_not_accept_df = data_not_accept_df.join(summary_details_df)
data_not_accept_df = data_not_accept_df.sort_values(by=summary_columns, ascending=False)
summary_columns = ['Category', 'Defective Details']
data_not_accept_df = pd.DataFrame(columns=summary_columns)
summary_details_df = pd.DataFrame()
label_df = pd.DataFrame(columns=['label'])
if not self._parse_rules_and_baseline(rules, baseline):
return data_not_accept_df, label_df
# run diagnosis rules for each node
for node in self._raw_data_df.index:
details_row, summary_data_row = self._run_diagnosis_rules_for_single_node(node)
if details_row:
data_not_accept_df.loc[node] = details_row
summary_details_df = pd.concat(
[summary_details_df,
pd.DataFrame([summary_data_row.to_dict()], index=[summary_data_row.name])]
)
label_df.loc[node] = 1
else:
label_df.loc[node] = 0
# combine details for defective nodes
if len(data_not_accept_df) != 0:
data_not_accept_df = data_not_accept_df.join(summary_details_df)
data_not_accept_df = data_not_accept_df.sort_values(by=summary_columns, ascending=False)
except Exception as e:
logger.error('DataDiagnosis: run diagnosis rules failed, message: {}'.format(str(e)))
return data_not_accept_df, label_df
def output_all_nodes_results(self, raw_data_df, data_not_accept_df):
......@@ -258,24 +251,21 @@ def output_all_nodes_results(self, raw_data_df, data_not_accept_df):
data_not_accept_df['Number Of Issues'] = data_not_accept_df['Defective Details'].map(
lambda x: len(x.split(','))
)
for index in range(len(append_columns)):
for index in range(len(append_columns) - 1, -1, -1):
if append_columns[index] not in data_not_accept_df:
logger.warning(
'DataDiagnosis: output_all_nodes_results - column {} not found in data_not_accept_df.'.format(
append_columns[index]
)
logger.log_and_raise(
Exception,
msg='DataDiagnosis: output_all_nodes_results - column {} not found in data_not_accept_df.'.
format(append_columns[index])
)
all_data_df[append_columns[index]] = None
else:
all_data_df = all_data_df.merge(
data_not_accept_df[[append_columns[index]]], left_index=True, right_index=True, how='left'
)
all_data_df = data_not_accept_df[[
append_columns[index]
]].merge(all_data_df, left_index=True, right_index=True, how='right')
all_data_df['Accept'] = all_data_df['Accept'].replace(np.nan, True)
all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].replace(np.nan, 0)
all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].astype(int)
all_data_df = all_data_df.replace(np.nan, '')
return all_data_df
def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path, rules):
......@@ -288,16 +278,16 @@ def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path
rules (dict): the rules of DataDiagnosis
"""
try:
data_not_accept_df = data_not_accept_df.convert_dtypes()
writer = pd.ExcelWriter(output_path, engine='xlsxwriter')
# Check whether writer is valiad
if not isinstance(writer, pd.ExcelWriter):
logger.error('DataDiagnosis: excel_data_output - invalid file path.')
return
logger.log_and_raise(exception=IOError, msg='DataDiagnosis: excel_data_output - invalid file path.')
file_handler.output_excel_raw_data(writer, raw_data_df, 'Raw Data')
file_handler.output_excel_data_not_accept(writer, data_not_accept_df, rules)
writer.save()
except Exception as e:
logger.error('DataDiagnosis: excel_data_output - {}'.format(str(e)))
logger.log_and_raise(exception=Exception, msg='DataDiagnosis: excel_data_output - {}'.format(str(e)))
def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path):
"""Output data_not_accept_df into jsonl file.
......@@ -306,24 +296,29 @@ def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path):
data_not_accept_df (DataFrame): the DataFrame to output
output_path (str): the path of output jsonl file
"""
data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
p = Path(output_path)
try:
data_not_accept_json = data_not_accept_df.to_json(orient='index')
data_not_accept = json.loads(data_not_accept_json)
if not isinstance(data_not_accept_df, pd.DataFrame):
logger.warning('DataDiagnosis: output json data - data_not_accept_df is not DataFrame.')
return
logger.log_and_raise(
Exception, msg='DataDiagnosis: output json data - data_not_accept_df is not DataFrame.'
)
if data_not_accept_df.empty:
logger.warning('DataDiagnosis: output json data - data_not_accept_df is empty.')
with p.open('w') as f:
pass
return
with p.open('w') as f:
for node in data_not_accept:
line = data_not_accept[node]
line['Index'] = node
line['index'] = node
json_str = json.dumps(line)
f.write(json_str + '\n')
except Exception as e:
logger.error('DataDiagnosis: output json data failed, msg: {}'.format(str(e)))
logger.log_and_raise(
exception=Exception, msg='DataDiagnosis: output json data failed, msg: {}'.format(str(e))
)
def output_diagnosis_in_json(self, data_not_accept_df, output_path):
"""Output data_not_accept_df into json file.
......@@ -332,7 +327,8 @@ def output_diagnosis_in_json(self, data_not_accept_df, output_path):
data_not_accept_df (DataFrame): the DataFrame to output
output_path (str): the path of output jsonl file
"""
data_not_accept_df['Index'] = data_not_accept_df.index
data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
data_not_accept_df = data_not_accept_df.reset_index()
data_not_accept_df = data_not_accept_df.rename(
columns={
'Defective Details': 'diagnosis/issue_details',
......@@ -358,29 +354,31 @@ def generate_md_lines(self, data_not_accept_df, rules, round):
Returns:
list: lines in markdown format
"""
data_not_accept_df['machine'] = data_not_accept_df.index
if len(data_not_accept_df) == 0:
return []
data_not_accept_df = data_not_accept_df.reset_index()
header = data_not_accept_df.columns.tolist()
header = header[-1:] + header[:-1]
data_not_accept_df = data_not_accept_df[header]
# format precision of values to n decimal digits
for rule in rules:
for metric in rules[rule]['metrics']:
if rules[rule]['function'] == 'variance':
if round and isinstance(round, int):
if 'function' in rules[rule]:
for metric in rules[rule]['metrics']:
if rules[rule]['function'] == 'variance':
if round and isinstance(round, int):
data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: x * 100, na_action='ignore'
)
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: x * 100, na_action='ignore'
)
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: '{}%'.format(x), na_action='ignore'
)
elif rules[rule]['function'] == 'value':
if round and isinstance(round, int):
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
lambda x: '{}%'.format(x), na_action='ignore'
)
elif rules[rule]['function'] == 'value':
if round and isinstance(round, int):
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
lines = file_handler.generate_md_table(data_not_accept_df, header)
return lines
......@@ -401,7 +399,7 @@ def run(
try:
rules = self._preprocess(raw_data_file, rule_file)
# read baseline
baseline = file_handler.read_baseline(baseline_file)
baseline = file_handler.read_baseline(baseline_file) if baseline_file is not None else {}
logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df)))
output_df, label_df = self.run_diagnosis_rules(rules, baseline)
logger.info('DataDiagnosis: Processed finished')
......@@ -424,7 +422,9 @@ def run(
else:
file_handler.output_lines_in_html(lines, output_path)
else:
logger.error('DataDiagnosis: output failed - unsupported output format')
logger.log_and_raise(
exception=Exception, msg='DataDiagnosis: output failed - unsupported output format'
)
logger.info('DataDiagnosis: Output results to {}'.format(output_path))
except Exception as e:
logger.error('DataDiagnosis: run failed - {}'.format(str(e)))
logger.log_and_raise(exception=Exception, msg='DataDiagnosis: run failed - {}'.format(str(e)))
......@@ -66,7 +66,7 @@ def check_criterion_with_a_value(rule):
"""
# parse criteria and check if valid
if not isinstance(eval(rule['criteria'])(0), bool):
logger.log_and_raise(exception=Exception, msg='invalid criteria format')
logger.log_and_raise(exception=ValueError, msg='invalid criteria format')
@staticmethod
def miss_test(metric, rule, data_row, details, categories):
......@@ -130,8 +130,10 @@ def variance(data_row, rule, summary_data_row, details, categories):
# check if metric pass the rule
val = data_row[metric]
baseline = rule['metrics'][metric]
if baseline == 0:
logger.log_and_raise(exception=Exception, msg='invalid baseline 0 in variance rule')
if baseline is None or baseline == 0:
logger.log_and_raise(
exception=ValueError, msg='invalid baseline 0 or baseline not found in variance rule'
)
var = (val - baseline) / baseline
summary_data_row[metric] = var
violate_metric = eval(rule['criteria'])(var)
......@@ -203,13 +205,20 @@ def multi_rules(rule, details, categories, store_values):
Returns:
number: 0 if the rule is passed, otherwise 1
"""
violated = eval(rule['criteria'])(store_values)
if not isinstance(violated, bool):
logger.log_and_raise(exception=Exception, msg='invalid upper criteria format')
if violated:
info = '{}:{}'.format(rule['name'], rule['criteria'])
RuleOp.add_categories_and_details(info, rule['categories'], details, categories)
return 1 if violated else 0
try:
violated = eval(rule['criteria'])(store_values)
if not isinstance(violated, bool):
logger.log_and_raise(exception=ValueError, msg='invalid criteria format')
if violated:
info = '{}:{}'.format(rule['name'], rule['criteria'])
RuleOp.add_categories_and_details(info, rule['categories'], details, categories)
return 1 if violated else 0
# the key defined in criteria is not found
except KeyError as e:
logger.log_and_raise(exception=KeyError, msg='invalid criteria format - {}'.format(str(e)))
# miss/failed test
except Exception:
return 0
@staticmethod
def failure_check(data_row, rule, summary_data_row, details, categories, raw_rule):
......
......@@ -28,8 +28,9 @@ def read_raw_data(raw_data_path):
p = Path(raw_data_path)
raw_data_df = pd.DataFrame()
if not p.is_file():
logger.error('FileHandler: invalid raw data path - {}'.format(raw_data_path))
return raw_data_df
logger.log_and_raise(
exception=FileNotFoundError, msg='FileHandler: invalid raw data path - {}'.format(raw_data_path)
)
try:
with p.open(encoding='utf-8') as f:
......@@ -38,7 +39,7 @@ def read_raw_data(raw_data_path):
raw_data_df = raw_data_df.rename(raw_data_df['node'])
raw_data_df = raw_data_df.drop(columns=['node'])
except Exception as e:
logger.error('Analyzer: invalid raw data fomat - {}'.format(str(e)))
logger.log_and_raise(exception=IOError, msg='Analyzer: invalid raw data fomat - {}'.format(str(e)))
return raw_data_df
......@@ -54,8 +55,9 @@ def read_rules(rule_file=None):
default_rule_file = Path(__file__).parent / 'rule/default_rule.yaml'
p = Path(rule_file) if rule_file else default_rule_file
if not p.is_file():
logger.error('FileHandler: invalid rule file path - {}'.format(str(p.resolve())))
return None
logger.log_and_raise(
exception=FileNotFoundError, msg='FileHandler: invalid rule file path - {}'.format(str(p.resolve()))
)
baseline = None
with p.open() as f:
baseline = yaml.load(f, Loader=yaml.SafeLoader)
......@@ -73,8 +75,9 @@ def read_baseline(baseline_file):
"""
p = Path(baseline_file)
if not p.is_file():
logger.error('FileHandler: invalid baseline file path - {}'.format(str(p.resolve())))
return None
logger.log_and_raise(
exception=FileNotFoundError, msg='FileHandler: invalid baseline file path - {}'.format(str(p.resolve()))
)
baseline = None
with p.open() as f:
baseline = json.load(f)
......@@ -119,45 +122,46 @@ def output_excel_data_not_accept(writer, data_not_accept_df, rules):
worksheet = writer.sheets['Not Accept']
for rule in rules:
for metric in rules[rule]['metrics']:
# The column index of the metrics should start from 1
col_index = columns.index(metric) + 1
# Apply percent format for the columns whose rules are variance type.
if rules[rule]['function'] == 'variance':
worksheet.conditional_format(
row_start,
col_index,
row_end,
col_index, # start_row, start_col, end_row, end_col
{
'type': 'no_blanks',
'format': percent_format
}
)
# Apply red format if the value violates the rule.
if rules[rule]['function'] == 'value' or rules[rule]['function'] == 'variance':
match = re.search(r'(>|<|<=|>=|==|!=)(.+)', rules[rule]['criteria'])
if not match:
continue
symbol = match.group(1)
condition = float(match.group(2))
worksheet.conditional_format(
row_start,
col_index,
row_end,
col_index, # start_row, start_col, end_row, end_col
{
'type': 'cell',
'criteria': symbol,
'value': condition,
'format': color_format_red
}
)
if 'function' in rules[rule]:
for metric in rules[rule]['metrics']:
# The column index of the metrics should start from 1
col_index = columns.index(metric) + 1
# Apply percent format for the columns whose rules are variance type.
if rules[rule]['function'] == 'variance':
worksheet.conditional_format(
row_start,
col_index,
row_end,
col_index, # start_row, start_col, end_row, end_col
{
'type': 'no_blanks',
'format': percent_format
}
)
# Apply red format if the value violates the rule.
if rules[rule]['function'] == 'value' or rules[rule]['function'] == 'variance':
match = re.search(r'(>|<|<=|>=|==|!=)(.+)', rules[rule]['criteria'])
if not match:
continue
symbol = match.group(1)
condition = float(match.group(2))
worksheet.conditional_format(
row_start,
col_index,
row_end,
col_index, # start_row, start_col, end_row, end_col
{
'type': 'cell',
'criteria': symbol,
'value': condition,
'format': color_format_red
}
)
else:
logger.warning('FileHandler: excel_data_output - data_not_accept_df is empty.')
else:
logger.warning('FileHandler: excel_data_output - data_not_accept_df is not DataFrame.')
logger.log_and_raise(RuntimeError, msg='FileHandler: excel_data_output - data_not_accept_df is not DataFrame.')
def generate_md_table(data_df, header):
......@@ -198,12 +202,11 @@ def output_lines_in_md(lines, output_path):
"""
try:
if len(lines) == 0:
logger.error('FileHandler: md_data_output failed')
return
logger.warning('FileHandler: md_data_output is empty')
with open(output_path, 'w') as f:
f.writelines(lines)
except Exception as e:
logger.error('FileHandler: md_data_output - {}'.format(str(e)))
logger.log_and_raise(exception=IOError, msg='FileHandler: md_data_output - {}'.format(str(e)))
def output_lines_in_html(lines, output_path):
......@@ -215,14 +218,13 @@ def output_lines_in_html(lines, output_path):
"""
try:
if len(lines) == 0:
logger.error('FileHandler: html_data_output failed')
return
logger.warning('FileHandler: html_data_output is empty')
lines = ''.join(lines)
html_str = markdown.markdown(lines, extensions=['markdown.extensions.tables'])
with open(output_path, 'w') as f:
f.writelines(html_str)
except Exception as e:
logger.error('FileHandler: html_data_output - {}'.format(str(e)))
logger.log_and_raise(exception=IOError, msg='FileHandler: html_data_output - {}'.format(str(e)))
def merge_column_in_excel(ws, row, column):
......
......@@ -103,8 +103,7 @@ def _preprocess(self, raw_data_file, rule_file):
self._benchmark_metrics_dict = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns))
# check raw data whether empty
if len(self._raw_data_df) == 0:
logger.error('RuleBase: empty raw data')
return None
logger.log_and_raise(exception=Exception, msg='RuleBase: empty raw data')
# read rules
rules = file_handler.read_rules(rule_file)
return rules
......@@ -3,6 +3,7 @@
"""Module of the base class."""
import shlex
import signal
import traceback
import argparse
......@@ -39,7 +40,7 @@ def __init__(self, name, parameters=''):
parameters (str): benchmark parameters.
"""
self._name = name
self._argv = list(filter(None, parameters.split(' '))) if parameters is not None else list()
self._argv = list(filter(None, shlex.split(parameters))) if parameters is not None else list()
self._benchmark_type = None
self._parser = argparse.ArgumentParser(
add_help=False,
......@@ -170,10 +171,11 @@ def run(self):
except BaseException as e:
self._result.set_return_code(ReturnCode.RUNTIME_EXCEPTION_ERROR)
logger.error('Run benchmark failed - benchmark: {}, message: {}'.format(self._name, str(e)))
else:
ret &= self._postprocess()
finally:
self._end_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
self._result.set_timestamp(self._start_time, self._end_time)
ret &= self._postprocess()
return ret
......
......@@ -254,7 +254,7 @@ def __prepare_config(self):
if not self._args.hostfile:
self._args.hostfile = os.path.join(os.environ.get('SB_WORKSPACE', '.'), 'hostfile')
with open(self._args.hostfile, 'r') as f:
hosts = f.readlines()
hosts = f.read().splitlines()
# Generate the config file if not define
if self._args.config is None:
self.gen_traffic_pattern(hosts, self._args.pattern, self.__config_path)
......@@ -297,15 +297,18 @@ def __prepare_general_ib_command_params(self):
# Add GPUDirect for ib command
gpu_dev = ''
if self._args.gpu_dev is not None:
gpu = GPU()
if gpu.vendor == 'nvidia':
gpu_dev = f'--use_cuda={self._args.gpu_dev}'
elif gpu.vendor == 'amd':
gpu_dev = f'--use_rocm={self._args.gpu_dev}'
else:
self._result.set_return_code(ReturnCode.INVALID_ARGUMENT)
logger.error('No GPU found - benchmark: {}'.format(self._name))
return False
if 'bw' in self._args.command:
gpu = GPU()
if gpu.vendor == 'nvidia':
gpu_dev = f'--use_cuda={self._args.gpu_dev}'
elif gpu.vendor == 'amd':
gpu_dev = f'--use_rocm={self._args.gpu_dev}'
else:
self._result.set_return_code(ReturnCode.INVALID_ARGUMENT)
logger.error('No GPU found - benchmark: {}'.format(self._name))
return False
elif 'lat' in self._args.command:
logger.warning('Wrong configuration: Perftest supports CUDA/ROCM only in BW tests')
# Generate ib command params
command_params = f'-F -n {self._args.iters} -d {self._args.ib_dev} {msg_size} {gpu_dev}'
command_params = f'{command_params.strip()} --report_gbits'
......
......@@ -260,13 +260,28 @@ void gather_hostnames(vector<string> &hostnames, string filename) {
}
// Parse raw output of ib command
// TODO: does not work latency tests
// Sample of ib bw command raw
// #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
// 8388608 5000 196.08 195.76 0.002917
// Sample of ib latency command raw output
// #bytes #iterations t_min t_max t_typical t_avg t_stdev 99% percentile 99.9% percentile
// 8388608 5000 581.27 876.26 594.87 595.50 3.33 601.65 621.14
// parsed result:
// 195.76 (BW average)
// 595.50 (t_avg)
float process_raw_output(string output) {
float res = -1.0;
try {
string pattern;
vector<string> lines;
boost::split(lines, output, boost::is_any_of("\n"), boost::token_compress_on);
regex re("\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+");
if (output.find("BW") != string::npos) {
pattern = "\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+";
} else {
pattern = "\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+"
"\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+";
}
regex re(pattern);
for (string line : lines) {
smatch m;
if (regex_search(line, m, re))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment