Commits · faeee0a7cc0636655dda479b977d00e9d88ef82c · tsoc / superbenchmark

13 Aug, 2022 1 commit

Auto generate ibstat file for topo aware traffic pattern (#381) · faeee0a7

Yang Wang authored Aug 13, 2022

An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.

faeee0a7

26 Jul, 2022 1 commit

Support topo-aware IB performance validation (#373) · ef4d6574

Jie Zhang authored Jul 26, 2022



* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

ef4d6574

13 Jul, 2022 1 commit

Add dependencies (#374) · 16b6385d

Yifan Xiong authored Jul 13, 2022

Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning

16b6385d

05 Jul, 2022 1 commit
- CLI - Support SKU auto detect if running on Azure VM (#365) · a94ead34
  Yifan Xiong authored Jul 05, 2022
```
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
```
  a94ead34
14 Jun, 2022 1 commit

Support `sb run` on host directly without Docker (#358) · a4937e95

Yifan Xiong authored Jun 14, 2022

**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.

a4937e95

29 Apr, 2022 1 commit

Release - SuperBench v0.5.0 (#350) · 6681c720

Yifan Xiong authored Apr 29, 2022



**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

6681c720

15 Mar, 2022 1 commit

Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18

user4543 authored Mar 15, 2022

**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis

b3c95f18

19 Jan, 2022 1 commit
- Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283) · fd2bc9e0
  guoshzhao authored Jan 19, 2022
```
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
```
  fd2bc9e0
30 Dec, 2021 1 commit

Release - SuperBench v0.4.0 (#278) · ff563b66

Yifan Xiong authored Dec 30, 2021



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

ff563b66

10 Dec, 2021 2 commits

Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245) · 4d85630a

guoshzhao authored Dec 10, 2021

**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.

4d85630a

Analyzer: Add Feature - Add basic analysis features (#248) · c2f942cb

Yuting Jiang authored Dec 10, 2021

**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric

c2f942cb

08 Dec, 2021 1 commit

Analyzer: Initialization - Add baseline-based data diagnosis module (#242) · c13ed2a2

Yuting Jiang authored Dec 08, 2021

**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators

c13ed2a2

12 Oct, 2021 1 commit

Benchmarks: Add Benchmark - Add tcp connectivity validation microbenchmark (#217) · 49cc8f9a

Yuting Jiang authored Oct 13, 2021

**Description**
Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile.

**Major Revision**
- Add tcp connectivity validation microbenchmark and related test, example

49cc8f9a

06 Sep, 2021 1 commit

Tools: Add Feature - Add script to generate system config info. (#160) · 37b15db9

Yuting Jiang authored Sep 06, 2021

**Description**
Add script to generate system config info.

**Major Revision**
- Add script to generate system config info into the dict in superbench/tools.

37b15db9

31 Aug, 2021 1 commit

Setup: Revision - Revise torch extra_require (#177) · c8357f4e

guoshzhao authored Aug 31, 2021

**Description**
change the minimal version requirement for superbench:
```
'torch>=1.7.0a0',
'torchvision>=0.8.0a0',
```

c8357f4e

20 Aug, 2021 1 commit

Runner: Add Feature - Generate summarized output files. (#157) · 7595d794

guoshzhao authored Aug 20, 2021

**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.

7595d794

23 Jun, 2021 1 commit

Bug bash - Fix bugs in multi GPU benchmarks (#98) · c0c43b8f

Yifan Xiong authored Jun 23, 2021

* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.

c0c43b8f

16 Jun, 2021 1 commit

Bug bash - Fix bugs and refine log in single GPU benchmarks (#97) · ddbc51a1

Yifan Xiong authored Jun 16, 2021

Fix bugs and refine log in single GPU benchmarks:

* Fix none framework issue
* Fix empty parameter bug
* Remove missed mobilenet_v3 models
* Change benchmark registration log to debug level
* Add pid in logging
* Add missing benchmarks in default config
* Fix deprecated logging warn

ddbc51a1

02 Jun, 2021 1 commit
- Runner - Support local mode in runner (#88) · 6b0ca1cb
  Yifan Xiong authored Jun 02, 2021
```
* Support local mode in runner.
```
  6b0ca1cb
01 Jun, 2021 1 commit
- Benchmarks: Add Feature - Add nvml package to provide python interfaces of nvidia. (#91) · 331c740a
  guoshzhao authored Jun 01, 2021
  
  331c740a
23 May, 2021 1 commit
- Runner - Implement ansible client and runner (#69) · c05e173b
  Yifan Xiong authored May 23, 2021
```
Implement ansible client and runner:
* add ansible client
* add deploy and check_env playbooks
```
  c05e173b
18 May, 2021 1 commit

CLI - Refine CLI handlers (#68) · 977b1a73

Yifan Xiong authored May 18, 2021

* use absolute path of input file
* parse registry uri from image
* merge common parts for arguments processing

977b1a73

12 Apr, 2021 2 commits
- CLI - Integration with Executor and Runner (#26) · 57114294
  Yifan Xiong authored Apr 12, 2021
```
* CLI integration with Executor and Runner
```
  57114294
- Add CUDA dockerfile for superbench (#43) · 67053d9a
  Yifan Xiong authored Apr 12, 2021
```
* add cuda11.1.1 dockerfile
```
  67053d9a
29 Mar, 2021 1 commit

Update logger (#28) · 0e2b2b08

Yifan Xiong authored Mar 29, 2021

Update logger class.
* add file handler along with stream handler
* add colored formatter

0e2b2b08

26 Mar, 2021 1 commit
- CLI: Code Revision - Use omegaconf to replace hydra for configuration (#27) · 91b44bc5
  Yifan Xiong authored Mar 26, 2021
```
Use omegaconf to replace hydra for configuration system:
* remove hydra
* use omegaconf to merge configurations
```
  91b44bc5
12 Mar, 2021 1 commit

CLI - Add command sb [version,deploy,exec,run] (#10) · 5d11579a

Yifan Xiong authored Mar 12, 2021

- Add CLI commands
  * sb version
  * sb deploy
  * sb exec
  * sb run
- Add interface with executor and runner
- Add cli test cases

5d11579a

11 Mar, 2021 1 commit

Benchmarks: Add Feature - Add random dataset for Pytorch. (#17) · ebea2d50

guoshzhao authored Mar 12, 2021



* add random dataset.

* install pytorch-cpu for test docker.

* fix typo

* add more test cases.

* address comments.
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>

ebea2d50

04 Feb, 2021 1 commit
- Setup: Add Test - Add Codecov (#9) · d32b96eb
  Yifan Xiong authored Feb 04, 2021
```
Add code coverage configuration.
```
  d32b96eb
01 Feb, 2021 2 commits

Docs - Initialize README (#6) · 3f19685f

Yifan Xiong authored Feb 01, 2021

Initialize README.md and update SUPPORT.md, update
* project description
* installation
* usage
* developer guide
* add dependencies version requirement

3f19685f

Setup: Revision - Update lint rules (#7) · a8977386

Yifan Xiong authored Feb 01, 2021

Update some lint rules, including:
* change max line length from 79 to 120, following [pytorch]
* add dedent_closing_brackets in yapf
* remove typed def requirements in mypy

Fix return code bug in setup.py, when lint/test command return 1,
`os.system` will return 256 and `sys.exit(256)` will get return code 0.

[pytorch]: https://github.com/pytorch/pytorch/blob/d1dcd5f/.flake8#L3

a8977386

28 Jan, 2021 1 commit

Setup: Init - Initialize setup.py and basic configs (#4) · 5be32481

Yifan Xiong authored Jan 28, 2021

Initialize setup.py and basic configurations for this project.

Major revisions:

- initialize setup.py for Python package
- add gitignore and dockerignore
- add editorconfig for editors
- configure yapf for auto formating
- configure mypy for type hint
- configure flake8 for lint, including quotes and docstrings
- add pre-commit check for `git commit`
- add spelling check in GitHub Actions
- format existing files according to configured rules

Example usage:

    # install dependencies
    $ python3 -m pip install -e .[dev,test]
    $ pre-commit install

    # format code automatically
    $ python3 setup.py format

    # lint code
    $ python3 setup.py lint

    # test code
    $ python3 setup.py test

5be32481