Unverified Commit 2e10fb0d authored by guoshzhao's avatar guoshzhao Committed by GitHub
Browse files

Docs - Update docs for monitor. (#265)

**Description**
Update docs for monitor.
parent cb8a3cfb
......@@ -66,7 +66,8 @@ Here're the details about work directory structure for SuperBench Runner.
│ ├── benchmarks # benchmarks directory
│ │ └── benchmark-0 # output for each benchmark
│ │ └── rank-0 # output for each rank in each benchmark
│ │ └── results.json # raw results
│ │ ├── results.json # raw results
| | └── monitor.jsonl # monitor results (optional)
│ └── sb-exec.log # collected SuperBench Executor log
├── sb-run.log # SuperBench Runner log
├── sb.config.yaml # SuperBench configuration snapshot
......@@ -78,7 +79,7 @@ Here're the details about work directory structure for SuperBench Runner.
### SuperBench Executor
SuperBench Executor is the component to run benchmarks inside Docker container.
It will execute each benchmark and handle all pre- and post-processing, including health check, result validation, result processing, etc.
It will start the monitor (optional), execute each benchmark and handle all pre- and post-processing, including health check, result validation, result processing, etc.
Here're the SuperBench Executor's work directory structure inside Docker container.
The `/root` directory is mounted from `$HOME/sb-workspace` on the host path.
......@@ -94,7 +95,8 @@ The `/root` directory is mounted from `$HOME/sb-workspace` on the host path.
├── benchmarks # benchmarks directory
│ └── benchmark-0 # output for each benchmark
│ └── rank-0 # output for each rank in each benchmark
│ └── results.json # raw results
│ ├── results.json # raw results
│ └── monitor.jsonl # monitor results (optional)
├── sb.config.yaml # SuperBench configuration snapshot
└── sb.env # SuperBench runtime environment variables
```
......
---
id: monitor
---
# Monitor
SuperBench provides a `Monitor` module to collect the system metrics and detect the failure during the benchmarking. Currently this monitor supports CUDA platform only. Users can enable it in the config file.
## Configuration
```yaml
superbench:
monitor:
enable: bool
sample_duration: int
sample_interval: int
```
### `enable`
Whether enable the monitor module or not.
### `sample_duration`
Calculate the average metrics during sample_duration seconds, such as CPU usage and NIC bandwidth.
### `sample_interval`
Do sampling every sample_interval seconds.
## Metrics
Monitor module will generate the data in jsonlines format, and each line is in json format, including the following metrics:
| Name | Unit | Description |
|-----------------------------------|------------|-------------------------------------------------------------|
| time | datetime | The timestamp to collect the system metrics. |
| cpu_usage | percentage | The average CPU utilization. |
| gpu_usage | percentage | The GPU utilization. |
| gpu_temperature | celsius | The GPU temperature. |
| gpu_power_limit | watt | The GPU power limitation. |
| gpu_mem_used | MB | The used GPU memory. |
| gpu_mem_total | MB | The total GPU memory. |
| gpu_corrected_ecc | count | Number of corrected (single bit) ECC error. |
| gpu_uncorrected_ecc | count | Number of uncorrected (double bit) ECC error. |
| gpu_remap_correctable_error | count | Number of rows remapped due to correctable errors. |
| gpu_remap_uncorrectable_error | count | Number of rows remapped due to uncorrectable. |
| gpu_remap_max | count | Number of banks with 8 available remapping resource. |
| gpu_remap_high | count | Number of banks with 7 available remapping resource. |
| gpu_remap_partial | count | Number of banks with 2~6 available remapping resource. |
| gpu_remap_low | count | Number of banks with 1 available remapping resource. |
| gpu_remap_none | count | Number of banks with 0 available remapping resource. |
| {device}_receive_bw | bytes/s | Network receive bandwidth. |
| {device}_transmit_bw | bytes/s | Network transmit bandwidth. |
......@@ -32,6 +32,7 @@ module.exports = {
},
'user-tutorial/system-config',
'user-tutorial/data-diagnosis',
'user-tutorial/monitor',
'user-tutorial/container-images',
],
},
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment