Commits · 37b15db92c5b8ed9e10ee88ac8a555e52cb683f0 · tsoc / superbenchmark

20 Aug, 2021 1 commit

Runner: Add Feature - Generate summarized output files. (#157) · 7595d794

guoshzhao authored Aug 20, 2021

**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.

7595d794

19 Aug, 2021 1 commit

Runner - Support mpi mode (#146) · 98b6c0e3

Yifan Xiong authored Aug 19, 2021



Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

98b6c0e3

08 Jul, 2021 1 commit

Runner & Executor - Support AMD GPU (#119) · 7458f83a

Yifan Xiong authored Jul 09, 2021

Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.

* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.

7458f83a

02 Jul, 2021 1 commit

Runner - Fetch benchmarks results on all nodes (#116) · fb7d4a73

Yifan Xiong authored Jul 02, 2021

Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```

fb7d4a73

01 Jul, 2021 1 commit
- CLI - Support custom output directory (#110) · 7b0b0e9a
  Yifan Xiong authored Jul 01, 2021
```
* Support custom output directory.
* Update document.
```
  7b0b0e9a
23 Jun, 2021 1 commit

Bug bash - Fix bugs in multi GPU benchmarks (#98) · c0c43b8f

Yifan Xiong authored Jun 23, 2021

* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.

c0c43b8f

02 Jun, 2021 1 commit
- Runner - Support local mode in runner (#88) · 6b0ca1cb
  Yifan Xiong authored Jun 02, 2021
```
* Support local mode in runner.
```
  6b0ca1cb
28 May, 2021 1 commit

Runner - Support torch.distributed mode in runner (#81) · 8b4f613a

Yifan Xiong authored May 28, 2021

* Support `torch.distributed` mode in runner.
* Support given `proc_num` and `node_num` in `torch.distributed` mode.

8b4f613a

26 May, 2021 1 commit
- CI/CD - Add integration tests for Ansible playbooks (#82) · e7f6d8ba
  Yifan Xiong authored May 26, 2021
```
* Add integration tests for Ansible playbooks
* Add `gpu_vendor` var to bypass gpu mount
```
  e7f6d8ba
23 May, 2021 1 commit
- Runner - Implement ansible client and runner (#69) · c05e173b
  Yifan Xiong authored May 23, 2021
```
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
```
  c05e173b
12 Apr, 2021 1 commit
- Runner: Init - Add superbench runner class (#38) · f73d1ade
  Yifan Xiong authored Apr 12, 2021
```
* init runner class with not implemented
```
  f73d1ade