monitor.md 2.94 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
id: monitor
---

# Monitor

SuperBench provides a `Monitor` module to collect the system metrics and detect the failure during the benchmarking. Currently this monitor supports CUDA platform only. Users can enable it in the config file.

## Configuration

```yaml
superbench:
  monitor:
    enable: bool
    sample_duration: int
    sample_interval: int
```

### `enable`

Whether enable the monitor module or not.

### `sample_duration`

Calculate the average metrics during sample_duration seconds, such as CPU usage and NIC bandwidth.

### `sample_interval`

Do sampling every sample_interval seconds.

## Metrics

Monitor module will generate the data in jsonlines format, and each line is in json format, including the following metrics:

| Name                              | Unit       | Description                                                 |
|-----------------------------------|------------|-------------------------------------------------------------|
| time                              | datetime   | The timestamp to collect the system metrics.                |
| cpu_usage                         | percentage | The average CPU utilization.                                |
| gpu_usage                         | percentage | The GPU utilization.                                        |
| gpu_temperature                   | celsius    | The GPU temperature.                                        |
| gpu_power_limit                   | watt       | The GPU power limitation.                                   |
| gpu_mem_used                      | MB         | The used GPU memory.                                        |
| gpu_mem_total                     | MB         | The total GPU memory.                                       |
| gpu_corrected_ecc                 | count      | Number of corrected (single bit) ECC error.                 |
| gpu_uncorrected_ecc               | count      | Number of uncorrected (double bit) ECC error.               |
| gpu_remap_correctable_error       | count      | Number of rows remapped due to correctable errors.          |
| gpu_remap_uncorrectable_error     | count      | Number of rows remapped due to uncorrectable.               |
| gpu_remap_max                     | count      | Number of banks with 8 available remapping resource.        |
| gpu_remap_high                    | count      | Number of banks with 7 available remapping resource.        |
| gpu_remap_partial                 | count      | Number of banks with 2~6 available remapping resource.      |
| gpu_remap_low                     | count      | Number of banks with 1 available remapping resource.        |
| gpu_remap_none                    | count      | Number of banks with 0 available remapping resource.        |
| {device}_receive_bw               | bytes/s    | Network receive bandwidth.                                  |
| {device}_transmit_bw              | bytes/s    | Network transmit bandwidth.                                 |