Commits · 682b2c120dd3ebfcac3be72f9f9225c53abe5bbc · tsoc / superbenchmark

08 Feb, 2022 1 commit
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
07 Feb, 2022 1 commit

Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055

Ziyue Yang authored Feb 07, 2022

**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.

85389055

30 Jan, 2022 1 commit
- Bug - Fix typo in document (#297) · 28195be6
  Yuting Jiang authored Jan 30, 2022
```
Fix typo in document.
```
  28195be6
29 Jan, 2022 3 commits
- Benchmarks - Support T4 and A10 in GEMM benchmark (#294) · 3419447c
  Yifan Xiong authored Jan 29, 2022
```
Support T4 and A10 in GEMM benchmark.
```
  3419447c
- Config - Support customized env for all modes (#295) · 3524975c
  Yifan Xiong authored Jan 29, 2022
```
Support customized env for all modes in configuration.
```
  3524975c
- Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296) · f3d05006
  Ziyue Yang authored Jan 29, 2022
```
Fix bug of GPU scan logic in bidirectional tests.
```
  f3d05006
28 Jan, 2022 2 commits

Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287) · d03d110f

guoshzhao authored Jan 28, 2022

**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.

d03d110f

Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288) · d877ca23

guoshzhao authored Jan 28, 2022

**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.

d877ca23

27 Jan, 2022 1 commit
- Config - Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml (#292) · f283b536
  Yuting Jiang authored Jan 28, 2022
```
**Description**
Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml
```
  f283b536
25 Jan, 2022 1 commit

Config - Update benchmark naming to support annotations (#284) · 7d7cd3dc

Yifan Xiong authored Jan 25, 2022

__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.

7d7cd3dc

24 Jan, 2022 2 commits

Bug: Fix code insecure issue that binds a socket to all network interfaces (#291) · 35fc06eb
Yuting Jiang authored Jan 24, 2022
```
**Description**
Fix code insecure issue that binds a socket to all network interfaces.
```
35fc06eb

Bug: Fix code incesure issue of integer overflow in cublas function (#290) · 380ce400

Yuting Jiang authored Jan 24, 2022

**Description**
Fix insecure issue of Multiplication result converted to larger type.

**Major Revision**
- Use a cast to ensure that the multiplication is done using the long long to avoid overflow.

380ce400

23 Jan, 2022 1 commit

Bump nanoid from 3.1.23 to 3.2.0 in /website (#286) · 5f6ad0cd

dependabot[bot] authored Jan 23, 2022

Bumps [nanoid](https://github.com/ai/nanoid) from 3.1.23 to 3.2.0.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ai/nanoid/compare/3.1.23...3.2.0

)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

5f6ad0cd

21 Jan, 2022 1 commit

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285) · 74421ffe

Ziyue Yang authored Jan 21, 2022

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

74421ffe

19 Jan, 2022 1 commit
- Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283) · fd2bc9e0
  guoshzhao authored Jan 19, 2022
```
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
```
  fd2bc9e0
18 Jan, 2022 1 commit

CLI - Add command sb benchmark [list,list-parameters] (#279) · f7ffc545

Yifan Xiong authored Jan 18, 2022

__Description__

Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.

<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
  --bin_dir str         Specify the directory of the benchmark binary.
  --duration int        The elapsed time of benchmark in seconds.
  --mem_type str [str ...]
                        Memory types to benchmark. E.g. htod dtoh dtod.
  --memory str          Memory argument for bandwidthtest. E.g. pinned unpinned.
  --run_count int       The run count of benchmark.
  --shmoo_mode          Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
 'duration': 0,
 'mem_type': ['htod', 'dtoh'],
 'memory': 'pinned',
 'run_count': 1}
</pre>
</details>

__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.

__Minor Revisions__
* Sort format help text for argparse.

f7ffc545

17 Jan, 2022 2 commits

Bump follow-redirects from 1.14.1 to 1.14.7 in /website (#282) · 9a909d2b

dependabot[bot] authored Jan 17, 2022

Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.1 to 1.14.7.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.1...v1.14.7

)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

9a909d2b

Bump shelljs from 0.8.4 to 0.8.5 in /website (#281) · 2538a7ee

dependabot[bot] authored Jan 17, 2022

Bumps [shelljs](https://github.com/shelljs/shelljs) from 0.8.4 to 0.8.5.
- [Release notes](https://github.com/shelljs/shelljs/releases)
- [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5

)

---
updated-dependencies:
- dependency-name: shelljs
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

2538a7ee

30 Dec, 2021 1 commit

Release - SuperBench v0.4.0 (#278) · ff563b66

Yifan Xiong authored Dec 30, 2021



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

ff563b66

14 Dec, 2021 1 commit
- Docs - Add usage for data diagnosis (#266) · 682ed06a
  Yuting Jiang authored Dec 14, 2021
```
**Description**
Add usage for data diagnosis.
```
  682ed06a
13 Dec, 2021 6 commits
- Docs - Update docs for monitor. (#265) · 2e10fb0d
  guoshzhao authored Dec 13, 2021
```
**Description**
Update docs for monitor.
```
  2e10fb0d
- Benchmarks - Add transformers for TensorRT inference (#254) · cb8a3cfb
  Yifan Xiong authored Dec 13, 2021
```
Add transformers for TensorRT inference.
```
  cb8a3cfb
- Docs - Add benchmark metrics for cpu-memory-bw-latency (#264) · 10012a0a
  Ziyue Yang authored Dec 13, 2021
```
**Description**
Add benchmark metrics for cpu-memory-bw-latency.
```
  10012a0a
- Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py #263 · b6781968
  Ziyue Yang authored Dec 13, 2021
```
**Description**
Benchmarks: Fix Comment - Correct benchmark name in test_gpu_copy_bw_performance.py.
```
  b6781968
- Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216) · b590409e
  Hossein Pourreza authored Dec 12, 2021
```
**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files
```
  b590409e
- Docs - Add a small note for using release container version (#262) · c403b1ca
  yangpanMS authored Dec 12, 2021
```
**Description**
Minor doc change to highlight sb CLI version is independent of the sb container version.
```
  c403b1ca
10 Dec, 2021 5 commits

Benchmarks: Add Benchmark - Add ONNXRuntime inference benchmark based on ORT python API (#245) · 4d85630a

guoshzhao authored Dec 10, 2021

**Description**
Add ONNXRuntime inference benchmark based on ORT python API.

**Major Revision**
- Add `ORTInferenceBenchmark` class to export pytorch model to onnx model and do inference
- Add tests and example for `ort-inference` benchmark
- Update the introduction docs.

4d85630a

Analyzer: Add Feature - Add basic analysis features (#248) · c2f942cb

Yuting Jiang authored Dec 10, 2021

**Description**
Add basic analysis features.

**Major Revision**
- Add statistics, correlations of the raw data
- Add numeric outlier detection(inter_quartile_range)
- Add boxplot for selected metric

c2f942cb

Monitor: Integration - Integrate monitor into Superbench (#259) · 6e357fb9

guoshzhao authored Dec 10, 2021

**Description**
Integrate monitor into Superbench.

**Major Revision**
- Initialize, start and stop monitor in SB executor.
- Parse the monitor data in SB runner and merge into benchmark results.
- Specify ReduceType for monitor metrics, such as MAX, MIN and LAST.
- Add monitor configs into config file.

6e357fb9

Benchmarks: Fix Bug - Set reduce_op type for metirc return_code (#261) · afea9913
guoshzhao authored Dec 10, 2021
```
**Description**
Set the `reduce_op` type for metirc `return_code` as `None`.
```
afea9913
CLI - Integrate data diagnosis (#260) · ed2f3c3c
Yuting Jiang authored Dec 10, 2021
```
**Description**
Add cli to integrate data diagnosis module.
```
ed2f3c3c

09 Dec, 2021 1 commit
- Benchmarks: Unify metric names of benchmarks (#252) · 9f56b219
  Yuting Jiang authored Dec 09, 2021
```
**Description**
Unify metric names of benchmarks.
```
  9f56b219
08 Dec, 2021 2 commits

Analyzer: Initialization - Add baseline-based data diagnosis module (#242) · c13ed2a2

Yuting Jiang authored Dec 08, 2021

**Description**
Add data diagnosis module.

**Major Revision**
- Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes
- Add RuleOp class to define rule operators

c13ed2a2

Bug - Fix issues for distributed runs (#258) · 213ab14b

Yifan Xiong authored Dec 08, 2021

Fix issues for distributed runs:
* fix config for memory bandwidth benchmarks
* add throttling for high concurrency docker pull
* update rsync path and exclude directories
* handle exceptions when creating summary
* tune for logging

213ab14b

07 Dec, 2021 1 commit
- Benchmarks: Add Feature - Add return_code metric into result (#256) · 44f0270e
  guoshzhao authored Dec 07, 2021
```
**Description**
Add return_code metric into result and revise unit tests.
```
  44f0270e
06 Dec, 2021 1 commit

Docs - Add doc for data diagnosis (#249) · 655f238d

Yuting Jiang authored Dec 06, 2021

**Description**
Add doc for data diagnosis, including input, output and baseline file schema.

655f238d

03 Dec, 2021 1 commit
- Benchmarks - Add config file for NDm A100 v4 (#255) · bd8f105d
  Yifan Xiong authored Dec 04, 2021
```
Add config file for Azure NDm A100 v4 SKU.
```
  bd8f105d
02 Dec, 2021 3 commits
- Benchmarks: Configuration - Add gpt-small into config files. (#253) · 8042fa34
  guoshzhao authored Dec 02, 2021
```
**Description**
Add gpt-small into config files.
```
  8042fa34
- Benchmarks: Add Feature - Add 'ignore_invalid' option when register benchmarks. (#247) · 371fd61c
  guoshzhao authored Dec 02, 2021
```
**Description**
If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
```
  371fd61c
- Benchmark: Replace `-c` argument with `-N` for `numactl` in Configuration (#250) · b4ea97bf
  Yifan Xiong authored Dec 02, 2021
```
**Description**
Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.
```
  b4ea97bf