Unverified Commit ed027e4c authored by Yuting Jiang's avatar Yuting Jiang Committed by GitHub
Browse files

Tools - Add runner for sys info and update docs (#532)

**Description**
Add runner for sys info to automatically collect on multiple nodes and
update related docs.

**Major Revision**
- add runner for sys info which will check docker status and run `sb
node info` on all nodes' docker and fetch results from all nodes

**Minor Revision**
- update cli and system-info doc
- update sb node info to save output info output-dir/sys-info.json
parent 3a6622f7
...@@ -165,6 +165,26 @@ Execute GPT2 model benchmark in default configuration: ...@@ -165,6 +165,26 @@ Execute GPT2 model benchmark in default configuration:
sb exec --config-override superbench.enable="['gpt2_models']" sb exec --config-override superbench.enable="['gpt2_models']"
``` ```
### `sb node info`
Get system info on the local node.
```bash title="SB CLI"
sb node info [--output-dir]
```
#### Optional arguments
| Name | Default | Description |
|----------------|---------|-----------------------------------------------------------------------------|
| `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. |
#### Examples
Get system info on the local node and save it into the `outputs` dir:
```bash title="SB CLI"
sb node info --output-dir outputs
```
### `sb result diagnosis` ### `sb result diagnosis`
Filter the defective machines automatically from benchmarking results according to rules defined in rule file. Filter the defective machines automatically from benchmarking results according to rules defined in rule file.
...@@ -284,6 +304,7 @@ sb run [--config-file] ...@@ -284,6 +304,7 @@ sb run [--config-file]
[--docker-image] [--docker-image]
[--docker-password] [--docker-password]
[--docker-username] [--docker-username]
[--get-info]
[--host-file] [--host-file]
[--host-list] [--host-list]
[--host-password] [--host-password]
...@@ -302,6 +323,7 @@ sb run [--config-file] ...@@ -302,6 +323,7 @@ sb run [--config-file]
| `--docker-image` `-i` | `superbench/superbench` | Docker image URI. | | `--docker-image` `-i` | `superbench/superbench` | Docker image URI. |
| `--docker-password` | `None` | Docker registry password if authentication is needed. | | `--docker-password` | `None` | Docker registry password if authentication is needed. |
| `--docker-username` | `None` | Docker registry username if authentication is needed. | | `--docker-username` | `None` | Docker registry username if authentication is needed. |
| `--get-info` | `False` | Collect system info. |
| `--host-file` `-f` | `None` | Path to Ansible inventory host file. | | `--host-file` `-f` | `None` | Path to Ansible inventory host file. |
| `--host-list` `-l` | `None` | Comma separated host list. | | `--host-list` `-l` | `None` | Comma separated host list. |
| `--host-password` | `None` | Host password or key passphase if needed. | | `--host-password` | `None` | Host password or key passphase if needed. |
...@@ -335,6 +357,16 @@ sb run --no-docker --host-list localhost --config-override \ ...@@ -335,6 +357,16 @@ sb run --no-docker --host-list localhost --config-override \
superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark
``` ```
Collect system info on all nodes in ./host.ini" distributed without running benchmarks:
```bash title="SB CLI"
sb run --get-info --host-file ./host.ini -C superbench.enable=none
```
Collect system info on all nodes in ./host.ini" distributed while running benchmarks:
```bash title="SB CLI"
sb run --get-info --host-file ./host.ini
```
### `sb version` ### `sb version`
Print the current SuperBench CLI version. Print the current SuperBench CLI version.
......
...@@ -4,6 +4,8 @@ id: system-config ...@@ -4,6 +4,8 @@ id: system-config
# System Config Info # System Config Info
This tool is to collect the system information automatically on the tested GPU nodes including the following hardware categories:
- [System](#system) - [System](#system)
- [Memory](#memory) - [Memory](#memory)
- [CPU](#cpu) - [CPU](#cpu)
...@@ -12,7 +14,33 @@ id: system-config ...@@ -12,7 +14,33 @@ id: system-config
- [Accelerator](#accelerator) - [Accelerator](#accelerator)
- [PCIe](#pcie) - [PCIe](#pcie)
## Parameter amd Details ## Usage
### Usage on local machine
1. [Install SuperBench](../getting-started/installation.mdx) on the local machine using root privilege.
2. Start to collect the sys info using `sb node info --output-dir ${output-dir}` command using root privilege.
3. After the command finished, you can find the output system info json file `sys-info.json` of local node under \${output_dir}.
### Usage on multiple remote machines
1. [Install SuperBench](../getting-started/installation.mdx) on the local machine.
2. [Deploy SuperBench](../getting-started/run-superbench.md#deploy) onto the remote machines.
2. Prepare the host file of the tested GPU nodes using [Ansible Inventory](../getting-started/configuration.md#ansible-inventory) on the local machine.
3. After installing the Superbnech and the host file is ready, you can start to collect the sys info automatically using `sb run --get-info` command. The detailed command can be found from [SuperBench CLI](../cli.md).
```
sb run --get-info -f host.ini --output-dir ${output-dir} -C superbench.enable=none
```
4. After the command finished, you can find the output system info json file `sys-info.json` of each node under \${output_dir}/nodes/${node_name}.
## Parameter and Details
### System ### System
......
...@@ -67,6 +67,9 @@ def load_arguments(self, command): ...@@ -67,6 +67,9 @@ def load_arguments(self, command):
nargs='+', nargs='+',
help='Extra arguments to override config_file.' help='Extra arguments to override config_file.'
) )
ac.argument(
'get_info', options_list=('--get-info', '-g'), action='store_true', help='Collect node system info.'
)
with ArgumentsContext(self, 'benchmark') as ac: with ArgumentsContext(self, 'benchmark') as ac:
ac.argument('name', options_list=('--name', '-n'), type=str, help='Benchmark name or regular expression.') ac.argument('name', options_list=('--name', '-n'), type=str, help='Benchmark name or regular expression.')
......
...@@ -275,7 +275,8 @@ def run_command_handler( ...@@ -275,7 +275,8 @@ def run_command_handler(
output_dir=None, output_dir=None,
private_key=None, private_key=None,
config_file=None, config_file=None,
config_override=None config_override=None,
get_info=False,
): ):
"""Run the SuperBench benchmarks distributedly. """Run the SuperBench benchmarks distributedly.
...@@ -295,6 +296,7 @@ def run_command_handler( ...@@ -295,6 +296,7 @@ def run_command_handler(
config_file (str, optional): Path to SuperBench config file. Defaults to None. config_file (str, optional): Path to SuperBench config file. Defaults to None.
config_override (str, optional): Extra arguments to override config_file, config_override (str, optional): Extra arguments to override config_file,
following [Hydra syntax](https://hydra.cc/docs/advanced/override_grammar/basic). Defaults to None. following [Hydra syntax](https://hydra.cc/docs/advanced/override_grammar/basic). Defaults to None.
get_info (bool, optional): Collect node system info. Defaults to False.
Raises: Raises:
CLIError: If input arguments are invalid. CLIError: If input arguments are invalid.
...@@ -316,6 +318,10 @@ def run_command_handler( ...@@ -316,6 +318,10 @@ def run_command_handler(
) )
runner = SuperBenchRunner(sb_config, docker_config, ansible_config, sb_output_dir) runner = SuperBenchRunner(sb_config, docker_config, ansible_config, sb_output_dir)
runner.run() runner.run()
if get_info:
runner.run_sys_info()
if runner.get_failure_count() != 0: if runner.get_failure_count() != 0:
sys.exit(runner.get_failure_count()) sys.exit(runner.get_failure_count())
...@@ -63,6 +63,10 @@ ...@@ -63,6 +63,10 @@
text: > text: >
{cli_name} run --no-docker --host-list localhost {cli_name} run --no-docker --host-list localhost
--config-override superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark --config-override superbench.enable=kernel-launch superbench.env.SB_MICRO_PATH=/path/to/superbenchmark
- name: Collect system info on all nodes in ./host.ini" without running benchmarks
text: {cli_name} run --get-info --host-file ./host.ini -C superbench.enable=none
- name: Collect system info on all nodes in ./host.ini" while running benchmarks
text: {cli_name} run --get-info --host-file ./host.ini
""".format(cli_name=CLI_NAME) """.format(cli_name=CLI_NAME)
helps['benchmark'] = """ helps['benchmark'] = """
......
...@@ -3,17 +3,28 @@ ...@@ -3,17 +3,28 @@
"""SuperBench CLI node subgroup command handler.""" """SuperBench CLI node subgroup command handler."""
from pathlib import Path
import json
from superbench.tools import SystemInfo from superbench.tools import SystemInfo
from superbench.common.utils import create_sb_output_dir
def info_command_handler(): def info_command_handler(output_dir=None):
"""Get node hardware info. """Get node hardware info.
Args:
output_dir (str): Output directory.
Returns: Returns:
dict: node info. dict: node info.
""" """
try: try:
info = SystemInfo().get_all() info = SystemInfo().get_all()
output_dir = create_sb_output_dir(output_dir)
output_dir_path = Path(output_dir)
with open(output_dir_path / 'sys_info.json', 'w') as f:
json.dump(info, f)
except Exception as ex: except Exception as ex:
raise RuntimeError('Failed to get node info.') from ex raise RuntimeError('Failed to get node info.') from ex
return info return info
...@@ -199,6 +199,24 @@ def deploy(self): # pragma: no cover ...@@ -199,6 +199,24 @@ def deploy(self): # pragma: no cover
) )
self._ansible_client.run(self._ansible_client.get_playbook_config('deploy.yaml', extravars=extravars)) self._ansible_client.run(self._ansible_client.get_playbook_config('deploy.yaml', extravars=extravars))
def run_sys_info(self):
"""Run the system info on all nodes."""
self.check_env()
logger.info('Runner is going to get node system info.')
fcmd = "docker exec sb-workspace bash -c '{command}'"
if self._docker_config.skip:
fcmd = "bash -c 'cd $SB_WORKSPACE && {command}'"
ansible_runner_config = self._ansible_client.get_shell_config(
fcmd.format(command='sb node info --output-dir {output_dir}'.format(output_dir=self._sb_output_dir))
)
ansible_rc = self._ansible_client.run(ansible_runner_config, sudo=(not self._docker_config.skip))
if ansible_rc != 0:
self.cleanup()
self.fetch_results()
def check_env(self): # pragma: no cover def check_env(self): # pragma: no cover
"""Check SuperBench environment.""" """Check SuperBench environment."""
logger.info('Checking SuperBench environment.') logger.info('Checking SuperBench environment.')
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment