Unverified Commit 63e9b2d1 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.6.0 (#409)



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: default avatarYang Wang <yangwang1@microsoft.com>
Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
parent 733860d7
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
__SuperBench__ is a validation and profiling tool for AI infrastructure. __SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.5.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.5.0) has been released! 📢 [v0.6.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.6.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._ ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
...@@ -26,6 +26,7 @@ RUN apt-get update && \ ...@@ -26,6 +26,7 @@ RUN apt-get update && \
curl \ curl \
dmidecode \ dmidecode \
git \ git \
iproute2 \
jq \ jq \
libaio-dev \ libaio-dev \
libcap2 \ libcap2 \
...@@ -38,6 +39,7 @@ RUN apt-get update && \ ...@@ -38,6 +39,7 @@ RUN apt-get update && \
openssh-client \ openssh-client \
openssh-server \ openssh-server \
pciutils \ pciutils \
sudo \
util-linux \ util-linux \
vim \ vim \
wget \ wget \
......
...@@ -31,6 +31,7 @@ RUN apt-get update && \ ...@@ -31,6 +31,7 @@ RUN apt-get update && \
dmidecode \ dmidecode \
git \ git \
hipify-clang \ hipify-clang \
iproute2 \
jq \ jq \
libaio-dev \ libaio-dev \
libboost-program-options-dev \ libboost-program-options-dev \
...@@ -46,6 +47,7 @@ RUN apt-get update && \ ...@@ -46,6 +47,7 @@ RUN apt-get update && \
openssh-server \ openssh-server \
pciutils \ pciutils \
rsync \ rsync \
sudo \
util-linux \ util-linux \
vim \ vim \
wget \ wget \
......
...@@ -30,6 +30,7 @@ RUN apt-get update && \ ...@@ -30,6 +30,7 @@ RUN apt-get update && \
dmidecode \ dmidecode \
git \ git \
hipify-clang \ hipify-clang \
iproute2 \
jq \ jq \
libaio-dev \ libaio-dev \
libboost-program-options-dev \ libboost-program-options-dev \
...@@ -46,6 +47,7 @@ RUN apt-get update && \ ...@@ -46,6 +47,7 @@ RUN apt-get update && \
openssh-server \ openssh-server \
pciutils \ pciutils \
rsync \ rsync \
sudo \
util-linux \ util-linux \
vim \ vim \
wget \ wget \
......
...@@ -180,16 +180,16 @@ sb result diagnosis --baseline-file ...@@ -180,16 +180,16 @@ sb result diagnosis --baseline-file
#### Required arguments #### Required arguments
| Name | Description | | Name | Description |
|------------------------|------------------------| |--------------------|------------------------|
| `--baseline-file` `-b` | Path to baseline file. | | `--data-file` `-d` | Path to raw data file. |
| `--data-file` `-d` | Path to raw data file. | | `--rule-file` `-r` | Path to rule file. |
| `--rule-file` `-r` | Path to rule file. |
#### Optional arguments #### Optional arguments
| Name | Default | Description | | Name | Default | Description |
|-------------------------|---------|-----------------------------------------------------------------------------| |-------------------------|---------|-----------------------------------------------------------------------------|
| `--baseline-file` `-b` | Path to baseline file. |
| `--decimal-place-value` | 2 | Number of valid decimal places to show in output. Default: 2. | | `--decimal-place-value` | 2 | Number of valid decimal places to show in output. Default: 2. |
| `--output-all` | N/A | Output diagnosis results for all nodes. | | `--output-all` | N/A | Output diagnosis results for all nodes. |
| `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. | | `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. |
......
...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it. ...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note :::note Note
You should checkout corresponding tag to use release version, for example, You should checkout corresponding tag to use release version, for example,
`git clone -b v0.5.0 https://github.com/microsoft/superbenchmark` `git clone -b v0.6.0 https://github.com/microsoft/superbenchmark`
::: :::
```bash ```bash
...@@ -96,7 +96,7 @@ Here're the system requirements for all managed GPU nodes. ...@@ -96,7 +96,7 @@ Here're the system requirements for all managed GPU nodes.
* Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later. * Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later.
* Compatible GPU drivers should be installed correctly. Driver version can be checked by running `nvidia-smi`. * Compatible GPU drivers should be installed correctly. Driver version can be checked by running `nvidia-smi`.
* [Docker CE](https://docs.docker.com/engine/install/) version 19.03 or later (which can be checked by running `docker --version`). * [Docker CE](https://docs.docker.com/engine/install/) version 20.10 or later (which can be checked by running `docker --version`).
* NVIDIA GPU support in Docker, install * NVIDIA GPU support in Docker, install
[nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit). [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit).
...@@ -106,7 +106,7 @@ Here're the system requirements for all managed GPU nodes. ...@@ -106,7 +106,7 @@ Here're the system requirements for all managed GPU nodes.
* Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later. * Latest version of Linux, you're highly encouraged to use Ubuntu 18.04 or later.
* Compatible GPU drivers should be installed correctly, and group permission should be set to access GPU resources. * Compatible GPU drivers should be installed correctly, and group permission should be set to access GPU resources.
You should be able to run `rocm-smi` and `rocminfo` directly to check GPU usage and information. You should be able to run `rocm-smi` and `rocminfo` directly to check GPU usage and information.
* [Docker CE](https://docs.docker.com/engine/install/) version 19.03 or later (which can be checked by running `docker --version`). * [Docker CE](https://docs.docker.com/engine/install/) version 20.10 or later (which can be checked by running `docker --version`).
</TabItem> </TabItem>
</Tabs> </Tabs>
...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password] ...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note :::note Note
You should deploy corresponding Docker image to use release version, for example, You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.5.0-cuda11.1.1` `sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone. You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
...@@ -70,7 +70,7 @@ superbench: ...@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'> <TabItem value='example'>
```yaml ```yaml
version: v0.5 version: v0.6
superbench: superbench:
enable: benchmark_1 enable: benchmark_1
monitor: monitor:
......
...@@ -29,6 +29,7 @@ available tags are listed below for all stable versions. ...@@ -29,6 +29,7 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
|-------------------|------------------------------------| |-------------------|------------------------------------|
| v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 |
| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 | | v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 | | v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 | | v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
...@@ -40,6 +41,10 @@ available tags are listed below for all stable versions. ...@@ -40,6 +41,10 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
|-------------------------------|--------------------------------------------------| |-------------------------------|--------------------------------------------------|
| v0.6.0-rocm5.1.3 | SuperBench v0.6.0 with ROCm 5.1.3 |
| v0.6.0-rocm5.1.1 | SuperBench v0.6.0 with ROCm 5.1.1 |
| v0.6.0-rocm5.0.1 | SuperBench v0.6.0 with ROCm 5.0.1 |
| v0.6.0-rocm5.0 | SuperBench v0.6.0 with ROCm 5.0 |
| v0.5.0-rocm5.0.1-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0.1, PyTorch 1.9.0 | | v0.5.0-rocm5.0.1-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0.1, PyTorch 1.9.0 |
| v0.5.0-rocm5.0-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0, PyTorch 1.9.0 | | v0.5.0-rocm5.0-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0, PyTorch 1.9.0 |
| v0.5.0-rocm4.2-pytorch1.7.0 | SuperBench v0.5.0 with ROCm 4.2, PyTorch 1.7.0 | | v0.5.0-rocm4.2-pytorch1.7.0 | SuperBench v0.5.0 with ROCm 4.2, PyTorch 1.7.0 |
......
...@@ -32,7 +32,7 @@ The input mainly includes 3 files: ...@@ -32,7 +32,7 @@ The input mainly includes 3 files:
- **rule file**: It uses YAML format and includes each metrics' rules to filter defective machines for diagnosis. - **rule file**: It uses YAML format and includes each metrics' rules to filter defective machines for diagnosis.
- **baseline file**: json file including the baseline values for the metrics. - **baseline file (optional)**: json file including the baseline values for the metrics.
`Tips`: this file for some representative machine types will be published in [SuperBench Results Repo](https://github.com/microsoft/superbench-results/tree/main) with the release of Superbench. `Tips`: this file for some representative machine types will be published in [SuperBench Results Repo](https://github.com/microsoft/superbench-results/tree/main) with the release of Superbench.
...@@ -52,8 +52,8 @@ superbench: ...@@ -52,8 +52,8 @@ superbench:
${var_name}: dict ${var_name}: dict
rules: rules:
${rule_name}: ${rule_name}:
function: string function: (optional)string
criteria: string criteria: (optional)string
store: (optional)bool store: (optional)bool
categories: string categories: string
metrics: metrics:
...@@ -65,11 +65,11 @@ superbench: ...@@ -65,11 +65,11 @@ superbench:
example: example:
```yaml ```yaml
# SuperBench rules # SuperBench rules
version: v0.5 version: v0.6
superbench: superbench:
rules: rules:
failure-rule: failure-rule:
function: value function: failure_check
criteria: lambda x:x>0 criteria: lambda x:x>0
categories: Failed categories: Failed
metrics: metrics:
...@@ -125,8 +125,17 @@ superbench: ...@@ -125,8 +125,17 @@ superbench:
- vgg_models/pytorch-vgg.*/throughput_train_.*\ - vgg_models/pytorch-vgg.*/throughput_train_.*\
rule6: rule6:
function: multi_rules function: multi_rules
criteria: 'lambda label:True if label["rule4"]+label["rule5"]>=2 else False' criteria: 'lambda label: bool(label["rule4"]+label["rule5"]>=2)'
categories: CNN categories: CNN
rule7:
categories: MODEL_DIST
store: True
metrics:
- model-benchmarks:stress-run.*/pytorch-gpt2-large/fp32_train_throughput
rule8:
function: multi_rules
criteria: 'lambda label: bool(min(label["rule7"].values()))<1)'
categories: MODEL_DIST
``` ```
This rule file describes the rules used for data diagnosis. This rule file describes the rules used for data diagnosis.
...@@ -147,15 +156,18 @@ The criterion used for this rule, which indicates how to compare the data with t ...@@ -147,15 +156,18 @@ The criterion used for this rule, which indicates how to compare the data with t
#### `store` #### `store`
True if the current rule is not used alone to filter the defective machine, but will be used by other subsequent rules. False(default) if this rule is used to label the defective machine directly. - True: this rule is used to store metrics which will be used by other subsequent rules.
- If store is True and criteria/function are not None in the rule, it will store how many metrics in this rule meet the criteria into lable["rule_name"], for example lable["rule_name"]=2 means 2 metrics are identified as defective in this rule;
- If store is True and criteria/function are None, it will store the dict of {metric_name: values} of the metrics into lable["rule_name"]
- False (default): this rule is used to label the defective machine directly.
#### `function` #### `function`
The function used for this rule. The function used for this rule.
3 types of rules are supported currently: The supported functions are listed as follows:
- `variance`: the rule is to check if the variance between raw data and baseline violates the criteria. variance = (raw data - criteria) / criteria - `variance`: the rule is to check if the variance between raw data and baseline violates the criteria. variance = (raw data - baseline) / baseline
For example, if the 'criteria' is `lambda x:x>0.05`, the rule is that if the variance is larger than 5%, it should be defective. For example, if the 'criteria' is `lambda x:x>0.05`, the rule is that if the variance is larger than 5%, it should be defective.
...@@ -164,8 +176,16 @@ The function used for this rule. ...@@ -164,8 +176,16 @@ The function used for this rule.
For example, if the 'criteria' is `lambda x:x>0`, the rule is that if the raw data is larger than the 0, it should be defective. For example, if the 'criteria' is `lambda x:x>0`, the rule is that if the raw data is larger than the 0, it should be defective.
- `multi_rules`: the rule is to check if the combined results of multiple previous rules and metrics violate the criteria. - `multi_rules`: the rule is to check if the combined results of multiple previous rules and metrics violate the criteria.
We would like to list several examples as follows:
- `criteria: lambda label: bool(label["rule4"]+label["rule5"]>=2)` means that this rule will be triggered if the sum of labeled metrics in rule4 and rule5 is larger than 2
- `criteria: lambda label: bool(min(label["rule7"].values()))<1)` means that if the minimum of the metrics' values in rule6 is smaller than 1, it should be defective.
- If you reference a non-existent rule, it will raise exception.
- If the test in the referenced rule failed or not run resulting in exception in creteria, it will not raise exception since it will be checked in failure_rule.
For example, if the 'criteria' is 'lambda label:True if label["rule4"]+label["rule5"]>=2 else False', the rule is that if the sum of labeled metrics in rule4 and rule5 is larger than 2, it should be defective. - `failure_check`: the rule is to check if any metric in this rule fail or miss the test. The metrics in this rule should be like `{benchmark_name}/.*:return_code` used to identify the failure.
- If any item is never matched with the metrics of the raw data, the rule will identify it as miss test.
- If any metric violate the `value` criteria which means return_code is not success(0), the rule will identify it as failed test.
`Tips`: you must contain a default rule for ${benchmark_name}/return_code as the above in the example, which is used to identify failed tests. `Tips`: you must contain a default rule for ${benchmark_name}/return_code as the above in the example, which is used to identify failed tests.
...@@ -182,6 +202,8 @@ The output includes all defective machines' information including index, failure ...@@ -182,6 +202,8 @@ The output includes all defective machines' information including index, failure
- Defective Details (diagnosis/issue_details in json format): all violated metrics including metric data and related rule. - Defective Details (diagnosis/issue_details in json format): all violated metrics including metric data and related rule.
- ${metric}: the data of the metrics defined in the rule file. If the rule is `variance`, the form of the data is variance in percentage; if the rule is `value`, the form of the data is raw data. - ${metric}: the data of the metrics defined in the rule file. If the rule is `variance`, the form of the data is variance in percentage; if the rule is `value`, the form of the data is raw data.
- `'N/A'` indicates a empty value for the metric in output files.
If you specify '--output-all' in the command, the output includes all machines' information and an extra field to indicate if the machines is defective. If you specify '--output-all' in the command, the output includes all machines' information and an extra field to indicate if the machines is defective.
......
...@@ -58,7 +58,7 @@ superbench: ...@@ -58,7 +58,7 @@ superbench:
```yaml title="Example" ```yaml title="Example"
# SuperBench rules # SuperBench rules
version: v0.5 version: v0.6
superbench: superbench:
rules: rules:
kernel_launch: kernel_launch:
......
...@@ -142,7 +142,7 @@ def run(self): ...@@ -142,7 +142,7 @@ def run(self):
install_requires=[ install_requires=[
'ansible_base>=2.10.9;os_name=="posix"', 'ansible_base>=2.10.9;os_name=="posix"',
'ansible_runner>=2.0.0rc1', 'ansible_runner>=2.0.0rc1',
'colorlog>=4.7.2', 'colorlog>=6.7.0',
'jinja2>=2.10.1', 'jinja2>=2.10.1',
'joblib>=1.0.1', 'joblib>=1.0.1',
'jsonlines>=2.0.0', 'jsonlines>=2.0.0',
...@@ -155,6 +155,7 @@ def run(self): ...@@ -155,6 +155,7 @@ def run(self):
'omegaconf==2.0.6', 'omegaconf==2.0.6',
'openpyxl>=3.0.7', 'openpyxl>=3.0.7',
'pandas>=1.1.5', 'pandas>=1.1.5',
'pssh @ git+https://github.com/lilydjwg/pssh.git@v2.3.4',
'pyyaml>=5.3', 'pyyaml>=5.3',
'requests>=2.27.1', 'requests>=2.27.1',
'seaborn>=0.11.2', 'seaborn>=0.11.2',
...@@ -169,8 +170,8 @@ def run(self): ...@@ -169,8 +170,8 @@ def run(self):
**x, **x,
'develop': x['dev'] + x['test'], 'develop': x['dev'] + x['test'],
'cpuworker': x['torch'], 'cpuworker': x['torch'],
'amdworker': x['torch'] + x['ort'] + x['mpi'], 'amdworker': x['torch'] + x['ort'],
'nvworker': x['torch'] + x['ort'] + x['mpi'] + x['nvidia'], 'nvworker': x['torch'] + x['ort'] + x['nvidia'],
} }
)( )(
{ {
...@@ -199,7 +200,6 @@ def run(self): ...@@ -199,7 +200,6 @@ def run(self):
'onnx>=1.10.2', 'onnx>=1.10.2',
'onnxruntime-gpu==1.10.0', 'onnxruntime-gpu==1.10.0',
], ],
'mpi': ['mpi4py>=3.1.3'],
'nvidia': ['py3nvml>=0.2.6'], 'nvidia': ['py3nvml>=0.2.6'],
} }
), ),
......
...@@ -6,5 +6,5 @@ ...@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems. Provide hardware and software benchmarks for AI systems.
""" """
__version__ = '0.5.0' __version__ = '0.6.0'
__author__ = 'Microsoft' __author__ = 'Microsoft'
...@@ -21,6 +21,7 @@ class DataDiagnosis(RuleBase): ...@@ -21,6 +21,7 @@ class DataDiagnosis(RuleBase):
def __init__(self): def __init__(self):
"""Init function.""" """Init function."""
super().__init__() super().__init__()
self.na = 'N/A'
def _check_and_format_rules(self, rule, name): def _check_and_format_rules(self, rule, name):
"""Check the rule of the metric whether the formart is valid. """Check the rule of the metric whether the formart is valid.
...@@ -63,8 +64,6 @@ def _get_baseline_of_metric(self, baseline, metric): ...@@ -63,8 +64,6 @@ def _get_baseline_of_metric(self, baseline, metric):
""" """
if metric in baseline: if metric in baseline:
return baseline[metric] return baseline[metric]
elif 'return_code' in metric:
return 0
else: else:
short = metric short = metric
# exclude rank info, for example, '.*:\d+'->'.*' # exclude rank info, for example, '.*:\d+'->'.*'
...@@ -76,8 +75,7 @@ def _get_baseline_of_metric(self, baseline, metric): ...@@ -76,8 +75,7 @@ def _get_baseline_of_metric(self, baseline, metric):
return baseline[short] return baseline[short]
# baseline not defined # baseline not defined
else: else:
logger.warning('DataDiagnosis: get baseline - {} baseline not found'.format(metric)) return None
return -1
def __get_metrics_and_baseline(self, rule, benchmark_rules, baseline): def __get_metrics_and_baseline(self, rule, benchmark_rules, baseline):
"""Get metrics with baseline in the rule. """Get metrics with baseline in the rule.
...@@ -108,8 +106,7 @@ def _parse_rules_and_baseline(self, rules, baseline): ...@@ -108,8 +106,7 @@ def _parse_rules_and_baseline(self, rules, baseline):
""" """
try: try:
if not rules: if not rules:
logger.error('DataDiagnosis: get criteria failed') logger.log_and_raise(exception=Exception, msg='DataDiagnosis: get criteria failed')
return False
self._sb_rules = {} self._sb_rules = {}
self._enable_metrics = set() self._enable_metrics = set()
benchmark_rules = rules['superbench']['rules'] benchmark_rules = rules['superbench']['rules']
...@@ -129,8 +126,7 @@ def _parse_rules_and_baseline(self, rules, baseline): ...@@ -129,8 +126,7 @@ def _parse_rules_and_baseline(self, rules, baseline):
self.__get_metrics_and_baseline(rule, benchmark_rules, baseline) self.__get_metrics_and_baseline(rule, benchmark_rules, baseline)
self._enable_metrics = sorted(list(self._enable_metrics)) self._enable_metrics = sorted(list(self._enable_metrics))
except Exception as e: except Exception as e:
logger.error('DataDiagnosis: get criteria failed - {}'.format(str(e))) logger.log_and_raise(exception=Exception, msg='DataDiagnosis: get criteria failed - {}'.format(str(e)))
return False
return True return True
...@@ -205,32 +201,29 @@ def run_diagnosis_rules(self, rules, baseline): ...@@ -205,32 +201,29 @@ def run_diagnosis_rules(self, rules, baseline):
data_not_accept_df (DataFrame): defective nodes's detailed information data_not_accept_df (DataFrame): defective nodes's detailed information
label_df (DataFrame): labels for all nodes label_df (DataFrame): labels for all nodes
""" """
try: summary_columns = ['Category', 'Defective Details']
summary_columns = ['Category', 'Defective Details'] data_not_accept_df = pd.DataFrame(columns=summary_columns)
data_not_accept_df = pd.DataFrame(columns=summary_columns) summary_details_df = pd.DataFrame()
summary_details_df = pd.DataFrame() label_df = pd.DataFrame(columns=['label'])
label_df = pd.DataFrame(columns=['label']) if not self._parse_rules_and_baseline(rules, baseline):
if not self._parse_rules_and_baseline(rules, baseline): return data_not_accept_df, label_df
return data_not_accept_df, label_df # run diagnosis rules for each node
# run diagnosis rules for each node for node in self._raw_data_df.index:
for node in self._raw_data_df.index: details_row, summary_data_row = self._run_diagnosis_rules_for_single_node(node)
details_row, summary_data_row = self._run_diagnosis_rules_for_single_node(node) if details_row:
if details_row: data_not_accept_df.loc[node] = details_row
data_not_accept_df.loc[node] = details_row summary_details_df = pd.concat(
summary_details_df = pd.concat( [summary_details_df,
[summary_details_df, pd.DataFrame([summary_data_row.to_dict()], index=[summary_data_row.name])]
pd.DataFrame([summary_data_row.to_dict()], index=[summary_data_row.name])] )
) label_df.loc[node] = 1
label_df.loc[node] = 1 else:
else: label_df.loc[node] = 0
label_df.loc[node] = 0 # combine details for defective nodes
# combine details for defective nodes if len(data_not_accept_df) != 0:
if len(data_not_accept_df) != 0: data_not_accept_df = data_not_accept_df.join(summary_details_df)
data_not_accept_df = data_not_accept_df.join(summary_details_df) data_not_accept_df = data_not_accept_df.sort_values(by=summary_columns, ascending=False)
data_not_accept_df = data_not_accept_df.sort_values(by=summary_columns, ascending=False)
except Exception as e:
logger.error('DataDiagnosis: run diagnosis rules failed, message: {}'.format(str(e)))
return data_not_accept_df, label_df return data_not_accept_df, label_df
def output_all_nodes_results(self, raw_data_df, data_not_accept_df): def output_all_nodes_results(self, raw_data_df, data_not_accept_df):
...@@ -258,24 +251,21 @@ def output_all_nodes_results(self, raw_data_df, data_not_accept_df): ...@@ -258,24 +251,21 @@ def output_all_nodes_results(self, raw_data_df, data_not_accept_df):
data_not_accept_df['Number Of Issues'] = data_not_accept_df['Defective Details'].map( data_not_accept_df['Number Of Issues'] = data_not_accept_df['Defective Details'].map(
lambda x: len(x.split(',')) lambda x: len(x.split(','))
) )
for index in range(len(append_columns)): for index in range(len(append_columns) - 1, -1, -1):
if append_columns[index] not in data_not_accept_df: if append_columns[index] not in data_not_accept_df:
logger.warning( logger.log_and_raise(
'DataDiagnosis: output_all_nodes_results - column {} not found in data_not_accept_df.'.format( Exception,
append_columns[index] msg='DataDiagnosis: output_all_nodes_results - column {} not found in data_not_accept_df.'.
) format(append_columns[index])
) )
all_data_df[append_columns[index]] = None
else: else:
all_data_df = all_data_df.merge( all_data_df = data_not_accept_df[[
data_not_accept_df[[append_columns[index]]], left_index=True, right_index=True, how='left' append_columns[index]
) ]].merge(all_data_df, left_index=True, right_index=True, how='right')
all_data_df['Accept'] = all_data_df['Accept'].replace(np.nan, True) all_data_df['Accept'] = all_data_df['Accept'].replace(np.nan, True)
all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].replace(np.nan, 0) all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].replace(np.nan, 0)
all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].astype(int) all_data_df['Number Of Issues'] = all_data_df['Number Of Issues'].astype(int)
all_data_df = all_data_df.replace(np.nan, '')
return all_data_df return all_data_df
def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path, rules): def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path, rules):
...@@ -288,16 +278,16 @@ def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path ...@@ -288,16 +278,16 @@ def output_diagnosis_in_excel(self, raw_data_df, data_not_accept_df, output_path
rules (dict): the rules of DataDiagnosis rules (dict): the rules of DataDiagnosis
""" """
try: try:
data_not_accept_df = data_not_accept_df.convert_dtypes()
writer = pd.ExcelWriter(output_path, engine='xlsxwriter') writer = pd.ExcelWriter(output_path, engine='xlsxwriter')
# Check whether writer is valiad # Check whether writer is valiad
if not isinstance(writer, pd.ExcelWriter): if not isinstance(writer, pd.ExcelWriter):
logger.error('DataDiagnosis: excel_data_output - invalid file path.') logger.log_and_raise(exception=IOError, msg='DataDiagnosis: excel_data_output - invalid file path.')
return
file_handler.output_excel_raw_data(writer, raw_data_df, 'Raw Data') file_handler.output_excel_raw_data(writer, raw_data_df, 'Raw Data')
file_handler.output_excel_data_not_accept(writer, data_not_accept_df, rules) file_handler.output_excel_data_not_accept(writer, data_not_accept_df, rules)
writer.save() writer.save()
except Exception as e: except Exception as e:
logger.error('DataDiagnosis: excel_data_output - {}'.format(str(e))) logger.log_and_raise(exception=Exception, msg='DataDiagnosis: excel_data_output - {}'.format(str(e)))
def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path): def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path):
"""Output data_not_accept_df into jsonl file. """Output data_not_accept_df into jsonl file.
...@@ -306,24 +296,29 @@ def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path): ...@@ -306,24 +296,29 @@ def output_diagnosis_in_jsonl(self, data_not_accept_df, output_path):
data_not_accept_df (DataFrame): the DataFrame to output data_not_accept_df (DataFrame): the DataFrame to output
output_path (str): the path of output jsonl file output_path (str): the path of output jsonl file
""" """
data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
p = Path(output_path) p = Path(output_path)
try: try:
data_not_accept_json = data_not_accept_df.to_json(orient='index') data_not_accept_json = data_not_accept_df.to_json(orient='index')
data_not_accept = json.loads(data_not_accept_json) data_not_accept = json.loads(data_not_accept_json)
if not isinstance(data_not_accept_df, pd.DataFrame): if not isinstance(data_not_accept_df, pd.DataFrame):
logger.warning('DataDiagnosis: output json data - data_not_accept_df is not DataFrame.') logger.log_and_raise(
return Exception, msg='DataDiagnosis: output json data - data_not_accept_df is not DataFrame.'
)
if data_not_accept_df.empty: if data_not_accept_df.empty:
logger.warning('DataDiagnosis: output json data - data_not_accept_df is empty.') with p.open('w') as f:
pass
return return
with p.open('w') as f: with p.open('w') as f:
for node in data_not_accept: for node in data_not_accept:
line = data_not_accept[node] line = data_not_accept[node]
line['Index'] = node line['index'] = node
json_str = json.dumps(line) json_str = json.dumps(line)
f.write(json_str + '\n') f.write(json_str + '\n')
except Exception as e: except Exception as e:
logger.error('DataDiagnosis: output json data failed, msg: {}'.format(str(e))) logger.log_and_raise(
exception=Exception, msg='DataDiagnosis: output json data failed, msg: {}'.format(str(e))
)
def output_diagnosis_in_json(self, data_not_accept_df, output_path): def output_diagnosis_in_json(self, data_not_accept_df, output_path):
"""Output data_not_accept_df into json file. """Output data_not_accept_df into json file.
...@@ -332,7 +327,8 @@ def output_diagnosis_in_json(self, data_not_accept_df, output_path): ...@@ -332,7 +327,8 @@ def output_diagnosis_in_json(self, data_not_accept_df, output_path):
data_not_accept_df (DataFrame): the DataFrame to output data_not_accept_df (DataFrame): the DataFrame to output
output_path (str): the path of output jsonl file output_path (str): the path of output jsonl file
""" """
data_not_accept_df['Index'] = data_not_accept_df.index data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
data_not_accept_df = data_not_accept_df.reset_index()
data_not_accept_df = data_not_accept_df.rename( data_not_accept_df = data_not_accept_df.rename(
columns={ columns={
'Defective Details': 'diagnosis/issue_details', 'Defective Details': 'diagnosis/issue_details',
...@@ -358,29 +354,31 @@ def generate_md_lines(self, data_not_accept_df, rules, round): ...@@ -358,29 +354,31 @@ def generate_md_lines(self, data_not_accept_df, rules, round):
Returns: Returns:
list: lines in markdown format list: lines in markdown format
""" """
data_not_accept_df['machine'] = data_not_accept_df.index if len(data_not_accept_df) == 0:
return []
data_not_accept_df = data_not_accept_df.reset_index()
header = data_not_accept_df.columns.tolist() header = data_not_accept_df.columns.tolist()
header = header[-1:] + header[:-1]
data_not_accept_df = data_not_accept_df[header]
# format precision of values to n decimal digits # format precision of values to n decimal digits
for rule in rules: for rule in rules:
for metric in rules[rule]['metrics']: if 'function' in rules[rule]:
if rules[rule]['function'] == 'variance': for metric in rules[rule]['metrics']:
if round and isinstance(round, int): if rules[rule]['function'] == 'variance':
if round and isinstance(round, int):
data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: x * 100, na_action='ignore'
)
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df[metric] = data_not_accept_df[metric].map( data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: x * 100, na_action='ignore' lambda x: '{}%'.format(x), na_action='ignore'
)
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df[metric] = data_not_accept_df[metric].map(
lambda x: '{}%'.format(x), na_action='ignore'
)
elif rules[rule]['function'] == 'value':
if round and isinstance(round, int):
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
) )
elif rules[rule]['function'] == 'value':
if round and isinstance(round, int):
data_not_accept_df = data_analysis.round_significant_decimal_places(
data_not_accept_df, round, [metric]
)
data_not_accept_df = data_not_accept_df.convert_dtypes().astype('object').fillna(self.na)
lines = file_handler.generate_md_table(data_not_accept_df, header) lines = file_handler.generate_md_table(data_not_accept_df, header)
return lines return lines
...@@ -401,7 +399,7 @@ def run( ...@@ -401,7 +399,7 @@ def run(
try: try:
rules = self._preprocess(raw_data_file, rule_file) rules = self._preprocess(raw_data_file, rule_file)
# read baseline # read baseline
baseline = file_handler.read_baseline(baseline_file) baseline = file_handler.read_baseline(baseline_file) if baseline_file is not None else {}
logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df))) logger.info('DataDiagnosis: Begin to process {} nodes'.format(len(self._raw_data_df)))
output_df, label_df = self.run_diagnosis_rules(rules, baseline) output_df, label_df = self.run_diagnosis_rules(rules, baseline)
logger.info('DataDiagnosis: Processed finished') logger.info('DataDiagnosis: Processed finished')
...@@ -424,7 +422,9 @@ def run( ...@@ -424,7 +422,9 @@ def run(
else: else:
file_handler.output_lines_in_html(lines, output_path) file_handler.output_lines_in_html(lines, output_path)
else: else:
logger.error('DataDiagnosis: output failed - unsupported output format') logger.log_and_raise(
exception=Exception, msg='DataDiagnosis: output failed - unsupported output format'
)
logger.info('DataDiagnosis: Output results to {}'.format(output_path)) logger.info('DataDiagnosis: Output results to {}'.format(output_path))
except Exception as e: except Exception as e:
logger.error('DataDiagnosis: run failed - {}'.format(str(e))) logger.log_and_raise(exception=Exception, msg='DataDiagnosis: run failed - {}'.format(str(e)))
...@@ -66,7 +66,7 @@ def check_criterion_with_a_value(rule): ...@@ -66,7 +66,7 @@ def check_criterion_with_a_value(rule):
""" """
# parse criteria and check if valid # parse criteria and check if valid
if not isinstance(eval(rule['criteria'])(0), bool): if not isinstance(eval(rule['criteria'])(0), bool):
logger.log_and_raise(exception=Exception, msg='invalid criteria format') logger.log_and_raise(exception=ValueError, msg='invalid criteria format')
@staticmethod @staticmethod
def miss_test(metric, rule, data_row, details, categories): def miss_test(metric, rule, data_row, details, categories):
...@@ -130,8 +130,10 @@ def variance(data_row, rule, summary_data_row, details, categories): ...@@ -130,8 +130,10 @@ def variance(data_row, rule, summary_data_row, details, categories):
# check if metric pass the rule # check if metric pass the rule
val = data_row[metric] val = data_row[metric]
baseline = rule['metrics'][metric] baseline = rule['metrics'][metric]
if baseline == 0: if baseline is None or baseline == 0:
logger.log_and_raise(exception=Exception, msg='invalid baseline 0 in variance rule') logger.log_and_raise(
exception=ValueError, msg='invalid baseline 0 or baseline not found in variance rule'
)
var = (val - baseline) / baseline var = (val - baseline) / baseline
summary_data_row[metric] = var summary_data_row[metric] = var
violate_metric = eval(rule['criteria'])(var) violate_metric = eval(rule['criteria'])(var)
...@@ -203,13 +205,20 @@ def multi_rules(rule, details, categories, store_values): ...@@ -203,13 +205,20 @@ def multi_rules(rule, details, categories, store_values):
Returns: Returns:
number: 0 if the rule is passed, otherwise 1 number: 0 if the rule is passed, otherwise 1
""" """
violated = eval(rule['criteria'])(store_values) try:
if not isinstance(violated, bool): violated = eval(rule['criteria'])(store_values)
logger.log_and_raise(exception=Exception, msg='invalid upper criteria format') if not isinstance(violated, bool):
if violated: logger.log_and_raise(exception=ValueError, msg='invalid criteria format')
info = '{}:{}'.format(rule['name'], rule['criteria']) if violated:
RuleOp.add_categories_and_details(info, rule['categories'], details, categories) info = '{}:{}'.format(rule['name'], rule['criteria'])
return 1 if violated else 0 RuleOp.add_categories_and_details(info, rule['categories'], details, categories)
return 1 if violated else 0
# the key defined in criteria is not found
except KeyError as e:
logger.log_and_raise(exception=KeyError, msg='invalid criteria format - {}'.format(str(e)))
# miss/failed test
except Exception:
return 0
@staticmethod @staticmethod
def failure_check(data_row, rule, summary_data_row, details, categories, raw_rule): def failure_check(data_row, rule, summary_data_row, details, categories, raw_rule):
......
...@@ -28,8 +28,9 @@ def read_raw_data(raw_data_path): ...@@ -28,8 +28,9 @@ def read_raw_data(raw_data_path):
p = Path(raw_data_path) p = Path(raw_data_path)
raw_data_df = pd.DataFrame() raw_data_df = pd.DataFrame()
if not p.is_file(): if not p.is_file():
logger.error('FileHandler: invalid raw data path - {}'.format(raw_data_path)) logger.log_and_raise(
return raw_data_df exception=FileNotFoundError, msg='FileHandler: invalid raw data path - {}'.format(raw_data_path)
)
try: try:
with p.open(encoding='utf-8') as f: with p.open(encoding='utf-8') as f:
...@@ -38,7 +39,7 @@ def read_raw_data(raw_data_path): ...@@ -38,7 +39,7 @@ def read_raw_data(raw_data_path):
raw_data_df = raw_data_df.rename(raw_data_df['node']) raw_data_df = raw_data_df.rename(raw_data_df['node'])
raw_data_df = raw_data_df.drop(columns=['node']) raw_data_df = raw_data_df.drop(columns=['node'])
except Exception as e: except Exception as e:
logger.error('Analyzer: invalid raw data fomat - {}'.format(str(e))) logger.log_and_raise(exception=IOError, msg='Analyzer: invalid raw data fomat - {}'.format(str(e)))
return raw_data_df return raw_data_df
...@@ -54,8 +55,9 @@ def read_rules(rule_file=None): ...@@ -54,8 +55,9 @@ def read_rules(rule_file=None):
default_rule_file = Path(__file__).parent / 'rule/default_rule.yaml' default_rule_file = Path(__file__).parent / 'rule/default_rule.yaml'
p = Path(rule_file) if rule_file else default_rule_file p = Path(rule_file) if rule_file else default_rule_file
if not p.is_file(): if not p.is_file():
logger.error('FileHandler: invalid rule file path - {}'.format(str(p.resolve()))) logger.log_and_raise(
return None exception=FileNotFoundError, msg='FileHandler: invalid rule file path - {}'.format(str(p.resolve()))
)
baseline = None baseline = None
with p.open() as f: with p.open() as f:
baseline = yaml.load(f, Loader=yaml.SafeLoader) baseline = yaml.load(f, Loader=yaml.SafeLoader)
...@@ -73,8 +75,9 @@ def read_baseline(baseline_file): ...@@ -73,8 +75,9 @@ def read_baseline(baseline_file):
""" """
p = Path(baseline_file) p = Path(baseline_file)
if not p.is_file(): if not p.is_file():
logger.error('FileHandler: invalid baseline file path - {}'.format(str(p.resolve()))) logger.log_and_raise(
return None exception=FileNotFoundError, msg='FileHandler: invalid baseline file path - {}'.format(str(p.resolve()))
)
baseline = None baseline = None
with p.open() as f: with p.open() as f:
baseline = json.load(f) baseline = json.load(f)
...@@ -119,45 +122,46 @@ def output_excel_data_not_accept(writer, data_not_accept_df, rules): ...@@ -119,45 +122,46 @@ def output_excel_data_not_accept(writer, data_not_accept_df, rules):
worksheet = writer.sheets['Not Accept'] worksheet = writer.sheets['Not Accept']
for rule in rules: for rule in rules:
for metric in rules[rule]['metrics']: if 'function' in rules[rule]:
# The column index of the metrics should start from 1 for metric in rules[rule]['metrics']:
col_index = columns.index(metric) + 1 # The column index of the metrics should start from 1
# Apply percent format for the columns whose rules are variance type. col_index = columns.index(metric) + 1
if rules[rule]['function'] == 'variance': # Apply percent format for the columns whose rules are variance type.
worksheet.conditional_format( if rules[rule]['function'] == 'variance':
row_start, worksheet.conditional_format(
col_index, row_start,
row_end, col_index,
col_index, # start_row, start_col, end_row, end_col row_end,
{ col_index, # start_row, start_col, end_row, end_col
'type': 'no_blanks', {
'format': percent_format 'type': 'no_blanks',
} 'format': percent_format
) }
# Apply red format if the value violates the rule. )
if rules[rule]['function'] == 'value' or rules[rule]['function'] == 'variance': # Apply red format if the value violates the rule.
match = re.search(r'(>|<|<=|>=|==|!=)(.+)', rules[rule]['criteria']) if rules[rule]['function'] == 'value' or rules[rule]['function'] == 'variance':
if not match: match = re.search(r'(>|<|<=|>=|==|!=)(.+)', rules[rule]['criteria'])
continue if not match:
symbol = match.group(1) continue
condition = float(match.group(2)) symbol = match.group(1)
worksheet.conditional_format( condition = float(match.group(2))
row_start, worksheet.conditional_format(
col_index, row_start,
row_end, col_index,
col_index, # start_row, start_col, end_row, end_col row_end,
{ col_index, # start_row, start_col, end_row, end_col
'type': 'cell', {
'criteria': symbol, 'type': 'cell',
'value': condition, 'criteria': symbol,
'format': color_format_red 'value': condition,
} 'format': color_format_red
) }
)
else: else:
logger.warning('FileHandler: excel_data_output - data_not_accept_df is empty.') logger.warning('FileHandler: excel_data_output - data_not_accept_df is empty.')
else: else:
logger.warning('FileHandler: excel_data_output - data_not_accept_df is not DataFrame.') logger.log_and_raise(RuntimeError, msg='FileHandler: excel_data_output - data_not_accept_df is not DataFrame.')
def generate_md_table(data_df, header): def generate_md_table(data_df, header):
...@@ -198,12 +202,11 @@ def output_lines_in_md(lines, output_path): ...@@ -198,12 +202,11 @@ def output_lines_in_md(lines, output_path):
""" """
try: try:
if len(lines) == 0: if len(lines) == 0:
logger.error('FileHandler: md_data_output failed') logger.warning('FileHandler: md_data_output is empty')
return
with open(output_path, 'w') as f: with open(output_path, 'w') as f:
f.writelines(lines) f.writelines(lines)
except Exception as e: except Exception as e:
logger.error('FileHandler: md_data_output - {}'.format(str(e))) logger.log_and_raise(exception=IOError, msg='FileHandler: md_data_output - {}'.format(str(e)))
def output_lines_in_html(lines, output_path): def output_lines_in_html(lines, output_path):
...@@ -215,14 +218,13 @@ def output_lines_in_html(lines, output_path): ...@@ -215,14 +218,13 @@ def output_lines_in_html(lines, output_path):
""" """
try: try:
if len(lines) == 0: if len(lines) == 0:
logger.error('FileHandler: html_data_output failed') logger.warning('FileHandler: html_data_output is empty')
return
lines = ''.join(lines) lines = ''.join(lines)
html_str = markdown.markdown(lines, extensions=['markdown.extensions.tables']) html_str = markdown.markdown(lines, extensions=['markdown.extensions.tables'])
with open(output_path, 'w') as f: with open(output_path, 'w') as f:
f.writelines(html_str) f.writelines(html_str)
except Exception as e: except Exception as e:
logger.error('FileHandler: html_data_output - {}'.format(str(e))) logger.log_and_raise(exception=IOError, msg='FileHandler: html_data_output - {}'.format(str(e)))
def merge_column_in_excel(ws, row, column): def merge_column_in_excel(ws, row, column):
......
...@@ -103,8 +103,7 @@ def _preprocess(self, raw_data_file, rule_file): ...@@ -103,8 +103,7 @@ def _preprocess(self, raw_data_file, rule_file):
self._benchmark_metrics_dict = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns)) self._benchmark_metrics_dict = self._get_metrics_by_benchmarks(list(self._raw_data_df.columns))
# check raw data whether empty # check raw data whether empty
if len(self._raw_data_df) == 0: if len(self._raw_data_df) == 0:
logger.error('RuleBase: empty raw data') logger.log_and_raise(exception=Exception, msg='RuleBase: empty raw data')
return None
# read rules # read rules
rules = file_handler.read_rules(rule_file) rules = file_handler.read_rules(rule_file)
return rules return rules
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
"""Module of the base class.""" """Module of the base class."""
import shlex
import signal import signal
import traceback import traceback
import argparse import argparse
...@@ -39,7 +40,7 @@ def __init__(self, name, parameters=''): ...@@ -39,7 +40,7 @@ def __init__(self, name, parameters=''):
parameters (str): benchmark parameters. parameters (str): benchmark parameters.
""" """
self._name = name self._name = name
self._argv = list(filter(None, parameters.split(' '))) if parameters is not None else list() self._argv = list(filter(None, shlex.split(parameters))) if parameters is not None else list()
self._benchmark_type = None self._benchmark_type = None
self._parser = argparse.ArgumentParser( self._parser = argparse.ArgumentParser(
add_help=False, add_help=False,
...@@ -170,10 +171,11 @@ def run(self): ...@@ -170,10 +171,11 @@ def run(self):
except BaseException as e: except BaseException as e:
self._result.set_return_code(ReturnCode.RUNTIME_EXCEPTION_ERROR) self._result.set_return_code(ReturnCode.RUNTIME_EXCEPTION_ERROR)
logger.error('Run benchmark failed - benchmark: {}, message: {}'.format(self._name, str(e))) logger.error('Run benchmark failed - benchmark: {}, message: {}'.format(self._name, str(e)))
else:
ret &= self._postprocess()
finally: finally:
self._end_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S') self._end_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
self._result.set_timestamp(self._start_time, self._end_time) self._result.set_timestamp(self._start_time, self._end_time)
ret &= self._postprocess()
return ret return ret
......
...@@ -254,7 +254,7 @@ def __prepare_config(self): ...@@ -254,7 +254,7 @@ def __prepare_config(self):
if not self._args.hostfile: if not self._args.hostfile:
self._args.hostfile = os.path.join(os.environ.get('SB_WORKSPACE', '.'), 'hostfile') self._args.hostfile = os.path.join(os.environ.get('SB_WORKSPACE', '.'), 'hostfile')
with open(self._args.hostfile, 'r') as f: with open(self._args.hostfile, 'r') as f:
hosts = f.readlines() hosts = f.read().splitlines()
# Generate the config file if not define # Generate the config file if not define
if self._args.config is None: if self._args.config is None:
self.gen_traffic_pattern(hosts, self._args.pattern, self.__config_path) self.gen_traffic_pattern(hosts, self._args.pattern, self.__config_path)
...@@ -297,15 +297,18 @@ def __prepare_general_ib_command_params(self): ...@@ -297,15 +297,18 @@ def __prepare_general_ib_command_params(self):
# Add GPUDirect for ib command # Add GPUDirect for ib command
gpu_dev = '' gpu_dev = ''
if self._args.gpu_dev is not None: if self._args.gpu_dev is not None:
gpu = GPU() if 'bw' in self._args.command:
if gpu.vendor == 'nvidia': gpu = GPU()
gpu_dev = f'--use_cuda={self._args.gpu_dev}' if gpu.vendor == 'nvidia':
elif gpu.vendor == 'amd': gpu_dev = f'--use_cuda={self._args.gpu_dev}'
gpu_dev = f'--use_rocm={self._args.gpu_dev}' elif gpu.vendor == 'amd':
else: gpu_dev = f'--use_rocm={self._args.gpu_dev}'
self._result.set_return_code(ReturnCode.INVALID_ARGUMENT) else:
logger.error('No GPU found - benchmark: {}'.format(self._name)) self._result.set_return_code(ReturnCode.INVALID_ARGUMENT)
return False logger.error('No GPU found - benchmark: {}'.format(self._name))
return False
elif 'lat' in self._args.command:
logger.warning('Wrong configuration: Perftest supports CUDA/ROCM only in BW tests')
# Generate ib command params # Generate ib command params
command_params = f'-F -n {self._args.iters} -d {self._args.ib_dev} {msg_size} {gpu_dev}' command_params = f'-F -n {self._args.iters} -d {self._args.ib_dev} {msg_size} {gpu_dev}'
command_params = f'{command_params.strip()} --report_gbits' command_params = f'{command_params.strip()} --report_gbits'
......
...@@ -260,13 +260,28 @@ void gather_hostnames(vector<string> &hostnames, string filename) { ...@@ -260,13 +260,28 @@ void gather_hostnames(vector<string> &hostnames, string filename) {
} }
// Parse raw output of ib command // Parse raw output of ib command
// TODO: does not work latency tests // Sample of ib bw command raw
// #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
// 8388608 5000 196.08 195.76 0.002917
// Sample of ib latency command raw output
// #bytes #iterations t_min t_max t_typical t_avg t_stdev 99% percentile 99.9% percentile
// 8388608 5000 581.27 876.26 594.87 595.50 3.33 601.65 621.14
// parsed result:
// 195.76 (BW average)
// 595.50 (t_avg)
float process_raw_output(string output) { float process_raw_output(string output) {
float res = -1.0; float res = -1.0;
try { try {
string pattern;
vector<string> lines; vector<string> lines;
boost::split(lines, output, boost::is_any_of("\n"), boost::token_compress_on); boost::split(lines, output, boost::is_any_of("\n"), boost::token_compress_on);
regex re("\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+"); if (output.find("BW") != string::npos) {
pattern = "\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+";
} else {
pattern = "\\d+\\s+\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+"
"\\s+(\\d+\\.\\d+)\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+\\s+\\d+\\.\\d+";
}
regex re(pattern);
for (string line : lines) { for (string line : lines) {
smatch m; smatch m;
if (regex_search(line, m, re)) if (regex_search(line, m, re))
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment