Commits · 6e357fb9d2038dabd4e2c07854c92ca7b0805cee · tsoc / superbenchmark

10 Dec, 2021 1 commit

Monitor: Integration - Integrate monitor into Superbench (#259) · 6e357fb9

guoshzhao authored Dec 10, 2021

**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.

6e357fb9

08 Dec, 2021 1 commit

Bug - Fix issues for distributed runs (#258) · 213ab14b

Yifan Xiong authored Dec 08, 2021

Fix issues for distributed runs:
* fix config for memory bandwidth benchmarks
* add throttling for high concurrency docker pull
* update rsync path and exclude directories
* handle exceptions when creating summary
* tune for logging

213ab14b

26 Sep, 2021 1 commit

Release - SuperBench v0.3.0 (#212) · dfbd70b1

Yifan Xiong authored Sep 26, 2021



**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)
Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>

dfbd70b1

20 Aug, 2021 1 commit

Runner: Add Feature - Generate summarized output files. (#157) · 7595d794

guoshzhao authored Aug 20, 2021

**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.

7595d794

19 Aug, 2021 1 commit

Runner - Support mpi mode (#146) · 98b6c0e3

Yifan Xiong authored Aug 19, 2021



Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

98b6c0e3

08 Jul, 2021 1 commit

Runner & Executor - Support AMD GPU (#119) · 7458f83a

Yifan Xiong authored Jul 09, 2021

Support both NVIDIA and AMD GPU and check GPU vendor during deployment and execution.

* Add GPU environment check in sb deploy.
* Check GPU vendor in executor.

7458f83a

02 Jul, 2021 1 commit

Runner - Fetch benchmarks results on all nodes (#116) · fb7d4a73

Yifan Xiong authored Jul 02, 2021

Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```

fb7d4a73

01 Jul, 2021 1 commit
- CLI - Support custom output directory (#110) · 7b0b0e9a
  Yifan Xiong authored Jul 01, 2021
```
* Support custom output directory.
* Update document.
```
  7b0b0e9a
23 Jun, 2021 1 commit

Bug bash - Fix bugs in multi GPU benchmarks (#98) · c0c43b8f

Yifan Xiong authored Jun 23, 2021

* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.

c0c43b8f

02 Jun, 2021 1 commit
- Runner - Support local mode in runner (#88) · 6b0ca1cb
  Yifan Xiong authored Jun 02, 2021
```
* Support local mode in runner.
```
  6b0ca1cb
28 May, 2021 1 commit

Runner - Support torch.distributed mode in runner (#81) · 8b4f613a

Yifan Xiong authored May 28, 2021

* Support `torch.distributed` mode in runner.
* Support given `proc_num` and `node_num` in `torch.distributed` mode.

8b4f613a

26 May, 2021 1 commit
- CI/CD - Add integration tests for Ansible playbooks (#82) · e7f6d8ba
  Yifan Xiong authored May 26, 2021
```
* Add integration tests for Ansible playbooks
* Add `gpu_vendor` var to bypass gpu mount
```
  e7f6d8ba
23 May, 2021 1 commit
- Runner - Implement ansible client and runner (#69) · c05e173b
  Yifan Xiong authored May 23, 2021
```
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
```
  c05e173b
12 Apr, 2021 1 commit
- Runner: Init - Add superbench runner class (#38) · f73d1ade
  Yifan Xiong authored Apr 12, 2021
```
* init runner class with not implemented
```
  f73d1ade