data-diagnosis.md 8.42 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
---
id: data-diagnosis
---

# Data Diagnosis

## Introduction

This tool is to filter the defective machines automatically from thousands of benchmarking results according to rules defined in **rule file**.

11
12
## Usage

13
1. [Install SuperBench](../getting-started/installation.mdx) on the local machine.
14
15
16

2. Prepare the raw data, rule file, baseline file under current path or somewhere on the local machine.

17
3. After installing the Superbnech and the files are ready, you can start to filter the defective machines automatically using  `sb result diagnosis` command. The detailed command can be found from [SuperBench CLI](../cli.md).
18
19

  ```
20
  sb result diagnosis --data-file ./results-summary.jsonl --rule-file ./rule.yaml --baseline-file ./baseline.json --output-file-format excel --output-dir ${output-dir}
21
22
23
24
  ```

4. After the command finished, you can find the output result file named 'diagnosis_summary.xlsx' / 'diagnosis_summary.json' under ${output_dir}.

25
26
27
28
29
30
31
32
33
34
## Input

The input mainly includes 3 files:

 - **raw data**: jsonl file including multiple nodes' results automatically generated by SuperBench runner.

    `Tips`: this file can be found at ${output-dir}/results-summary.jsonl after each successful run.

 - **rule file**: It uses YAML format and includes each metrics' rules to filter defective machines for diagnosis.

35
 - **baseline file (optional)**: json file including the baseline values for the metrics.
36
37
38
39
40
41
42

    `Tips`: this file for some representative machine types will be published in [SuperBench Results Repo](https://github.com/microsoft/superbench-results/tree/main) with the release of Superbench.

### rule file

This section describes how to write rules in **rule file**.

43
The convention is the same with [SuperBench Config File](../superbench-config.mdx), please view it first.
44
45
46
47
48
49
50
51
52
53
54

Here is an overview of the rule file structure:

scheme:
```yaml
version: string
superbench:
  var:
    ${var_name}: dict
  rules:
    ${rule_name}:
55
56
      function: (optional)string
      criteria: (optional)string
57
      store: (optional)bool
58
59
60
61
62
63
64
65
66
67
      categories: string
      metrics:
        - ${benchmark_name}/regex
        - ${benchmark_name}/regex
        ...
```

example:
```yaml
# SuperBench rules
68
version: v0.12
69
70
71
superbench:
  rules:
    failure-rule:
72
      function: failure_check
73
74
75
76
77
78
79
80
81
82
83
84
85
      criteria: lambda x:x>0
      categories: Failed
      metrics:
        - kernel-launch/return_code
        - mem-bw/return_code
        - nccl-bw/return_code
        - ib-loopback/return_code
    rule0:
    # Rule 0: If KernelLaunch suffers > 5% downgrade, label it as defective
      function: variance
      criteria: lambda x:x>0.05
      categories: KernelLaunch
      metrics:
one's avatar
one committed
86
87
88
        - kernel-launch/e2e_latency_us:\d+
        - kernel-launch/host_dispatch_us:\d+
        - kernel-launch/device_launch_us:\d+
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
    rule1:
    # Rule 1: If H2D_Mem_BW or D2H_Mem_BW test suffers > 5% downgrade, label it as defective
      function: variance
      criteria: lambda x:x<-0.05
      categories: Mem
      metrics:
        - mem-bw/H2D_Mem_BW:\d+
        - mem-bw/D2H_Mem_BW:\d+
    rule2:
    # Rule 2: If NCCL_BW suffers > 5% downgrade, label it as defective
      function: variance
      criteria: lambda x:x<-0.05
      categories: NCCL
      metircs:
        - nccl-bw/allreduce_8589934592_busbw:0
    rule3:
    # Rule 3: If GPT-2, BERT suffers > 5% downgrade, label it as defective
      function: variance
      criteria: lambda x:x<-0.05
      categories: Model
      metrics:
        - bert_models/pytorch-bert-base/throughput_train_float(32|16)
        - bert_models/pytorch-bert-large/throughput_train_float(32|16)
        - gpt_models/pytorch-gpt-large/throughput_train_float(32|16)
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
    rule4:
      function: variance
      criteria: "lambda x:x<-0.05"
      store: True
      categories: CNN
      metrics:
        - resnet_models/pytorch-resnet.*/throughput_train_.*
    rule5:
      function: variance
      criteria: "lambda x:x<-0.05"
      store: True
      categories: CNN
      metrics:
        - vgg_models/pytorch-vgg.*/throughput_train_.*\
    rule6:
      function: multi_rules
129
      criteria: 'lambda label: bool(label["rule4"]+label["rule5"]>=2)'
130
      categories: CNN
131
132
133
134
135
136
137
138
139
    rule7:
      categories: MODEL_DIST
      store: True
      metrics:
        - model-benchmarks:stress-run.*/pytorch-gpt2-large/fp32_train_throughput
    rule8:
      function: multi_rules
      criteria: 'lambda label: bool(min(label["rule7"].values()))<1)'
      categories: MODEL_DIST
140
141
142
143
```

This rule file describes the rules used for data diagnosis.

144
They are firstly organized by the rule name, and each rule mainly includes several elements:
145
146
147
148
149
150
151
152
153
154
155

#### `metrics`

The list of metrics for this rule. Each metric is in the format of ${benchmark_name}/regex, you can use regex after the first '/', but to be noticed, the benchmark name can not be a regex.

#### `categories`

The categories belong to this rule.

#### `criteria`

156
157
158
159
The criterion used for this rule, which indicates how to compare the data with the baseline value for each metric. The format should be a lambda function supported by Python.

#### `store`

160
161
162
163
- True: this rule is used to store metrics which will be used by other subsequent rules.
  - If store is True and criteria/function are not None in the rule, it will store how many metrics in this rule meet the criteria into lable["rule_name"], for example lable["rule_name"]=2 means 2 metrics are identified as defective in this rule;
  - If store is True and criteria/function are None, it will store the dict of {metric_name: values} of the metrics into lable["rule_name"]
- False (default): this rule is used to label the defective machine directly.
164
165
166
167
168

#### `function`

The function used for this rule.

169
The supported functions are listed as follows:
170

171
- `variance`: the rule is to check if the variance between raw data and baseline violates the criteria. variance = (raw data - baseline) / baseline
172

173
  For example, if the 'criteria' is `lambda x:x>0.05`, the rule is that if the variance is larger than 5%, it should be defective.
174
175
176

- `value`: the rule is to check if the raw data violate the criteria.

177
178
179
  For example, if the 'criteria' is `lambda x:x>0`, the rule is that if the raw data is larger than the 0, it should be defective.

- `multi_rules`: the rule is to check if the combined results of multiple previous rules and metrics violate the criteria.
180
181
182
183
184
  We would like to list several examples as follows:
  - `criteria: lambda label: bool(label["rule4"]+label["rule5"]>=2)` means that this rule will be triggered if the sum of labeled metrics in rule4 and rule5 is larger than 2
  - `criteria: lambda label: bool(min(label["rule7"].values()))<1)` means that if the minimum of the metrics' values in rule6 is smaller than 1, it should be defective.
    - If you reference a non-existent rule, it will raise exception.
    - If the test in the referenced rule failed or not run resulting in exception in creteria, it will not raise exception since it will be checked in failure_rule.
185

186
187
188
189
- `failure_check`: the rule is to check if any metric in this rule fail or miss the test. The metrics in this rule should be like `{benchmark_name}/.*:return_code` used to identify the failure.

  - If any item is never matched with the metrics of the raw data, the rule will identify it as miss test.
  - If any metric violate the `value` criteria which means return_code is not success(0), the rule will identify it as failed test.
190
191
192
193
194

`Tips`: you must contain a default rule for ${benchmark_name}/return_code as the above in the example, which is used to identify failed tests.

## Output

195
196
197
We support different output formats for filtering the defective machines including json, jsonl, excel, md and html.

The output includes all defective machines' information including index, failure category, failure details, and detailed metrics by default.
198
199
200

- index: the name of defective machines.

201
- Category (diagnosis/category in json format): categories defined in the rule.
202

203
- Defective Details (diagnosis/issue_details in json format): all violated metrics including metric data and related rule.
204
205

- ${metric}: the data of the metrics defined in the rule file. If the rule is `variance`, the form of the data is variance in percentage; if the rule is `value`, the form of the data is raw data.
206
207
  - `'N/A'` indicates a empty value for the metric in output files.

208
209
210
211

If you specify '--output-all' in the command, the output includes all machines' information and an extra field to indicate if the machines is defective.

- Accept (diagnosis/accept in json format): False if the machine is defective, otherwise True.