Unverified Commit 2e10fb0d authored by guoshzhao's avatar guoshzhao Committed by GitHub
Browse files

Docs - Update docs for monitor. (#265)

**Description**
Update docs for monitor.
parent cb8a3cfb
...@@ -66,7 +66,8 @@ Here're the details about work directory structure for SuperBench Runner. ...@@ -66,7 +66,8 @@ Here're the details about work directory structure for SuperBench Runner.
│ ├── benchmarks # benchmarks directory │ ├── benchmarks # benchmarks directory
│ │ └── benchmark-0 # output for each benchmark │ │ └── benchmark-0 # output for each benchmark
│ │ └── rank-0 # output for each rank in each benchmark │ │ └── rank-0 # output for each rank in each benchmark
│ │ └── results.json # raw results │ │ ├── results.json # raw results
| | └── monitor.jsonl # monitor results (optional)
│ └── sb-exec.log # collected SuperBench Executor log │ └── sb-exec.log # collected SuperBench Executor log
├── sb-run.log # SuperBench Runner log ├── sb-run.log # SuperBench Runner log
├── sb.config.yaml # SuperBench configuration snapshot ├── sb.config.yaml # SuperBench configuration snapshot
...@@ -78,7 +79,7 @@ Here're the details about work directory structure for SuperBench Runner. ...@@ -78,7 +79,7 @@ Here're the details about work directory structure for SuperBench Runner.
### SuperBench Executor ### SuperBench Executor
SuperBench Executor is the component to run benchmarks inside Docker container. SuperBench Executor is the component to run benchmarks inside Docker container.
It will execute each benchmark and handle all pre- and post-processing, including health check, result validation, result processing, etc. It will start the monitor (optional), execute each benchmark and handle all pre- and post-processing, including health check, result validation, result processing, etc.
Here're the SuperBench Executor's work directory structure inside Docker container. Here're the SuperBench Executor's work directory structure inside Docker container.
The `/root` directory is mounted from `$HOME/sb-workspace` on the host path. The `/root` directory is mounted from `$HOME/sb-workspace` on the host path.
...@@ -94,7 +95,8 @@ The `/root` directory is mounted from `$HOME/sb-workspace` on the host path. ...@@ -94,7 +95,8 @@ The `/root` directory is mounted from `$HOME/sb-workspace` on the host path.
├── benchmarks # benchmarks directory ├── benchmarks # benchmarks directory
│ └── benchmark-0 # output for each benchmark │ └── benchmark-0 # output for each benchmark
│ └── rank-0 # output for each rank in each benchmark │ └── rank-0 # output for each rank in each benchmark
│ └── results.json # raw results │ ├── results.json # raw results
│ └── monitor.jsonl # monitor results (optional)
├── sb.config.yaml # SuperBench configuration snapshot ├── sb.config.yaml # SuperBench configuration snapshot
└── sb.env # SuperBench runtime environment variables └── sb.env # SuperBench runtime environment variables
``` ```
......
---
id: monitor
---
# Monitor
SuperBench provides a `Monitor` module to collect the system metrics and detect the failure during the benchmarking. Currently this monitor supports CUDA platform only. Users can enable it in the config file.
## Configuration
```yaml
superbench:
monitor:
enable: bool
sample_duration: int
sample_interval: int
```
### `enable`
Whether enable the monitor module or not.
### `sample_duration`
Calculate the average metrics during sample_duration seconds, such as CPU usage and NIC bandwidth.
### `sample_interval`
Do sampling every sample_interval seconds.
## Metrics
Monitor module will generate the data in jsonlines format, and each line is in json format, including the following metrics:
| Name | Unit | Description |
|-----------------------------------|------------|-------------------------------------------------------------|
| time | datetime | The timestamp to collect the system metrics. |
| cpu_usage | percentage | The average CPU utilization. |
| gpu_usage | percentage | The GPU utilization. |
| gpu_temperature | celsius | The GPU temperature. |
| gpu_power_limit | watt | The GPU power limitation. |
| gpu_mem_used | MB | The used GPU memory. |
| gpu_mem_total | MB | The total GPU memory. |
| gpu_corrected_ecc | count | Number of corrected (single bit) ECC error. |
| gpu_uncorrected_ecc | count | Number of uncorrected (double bit) ECC error. |
| gpu_remap_correctable_error | count | Number of rows remapped due to correctable errors. |
| gpu_remap_uncorrectable_error | count | Number of rows remapped due to uncorrectable. |
| gpu_remap_max | count | Number of banks with 8 available remapping resource. |
| gpu_remap_high | count | Number of banks with 7 available remapping resource. |
| gpu_remap_partial | count | Number of banks with 2~6 available remapping resource. |
| gpu_remap_low | count | Number of banks with 1 available remapping resource. |
| gpu_remap_none | count | Number of banks with 0 available remapping resource. |
| {device}_receive_bw | bytes/s | Network receive bandwidth. |
| {device}_transmit_bw | bytes/s | Network transmit bandwidth. |
...@@ -32,6 +32,7 @@ module.exports = { ...@@ -32,6 +32,7 @@ module.exports = {
}, },
'user-tutorial/system-config', 'user-tutorial/system-config',
'user-tutorial/data-diagnosis', 'user-tutorial/data-diagnosis',
'user-tutorial/monitor',
'user-tutorial/container-images', 'user-tutorial/container-images',
], ],
}, },
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment