Commits · b5b1c3dac7831f3568fb6b574f6ade2c7dc5b575 · tsoc / superbenchmark

25 Apr, 2022 1 commit

Bug - Fix bug of duration feature for model benchmarks in distributed mode. (#347) · b5b1c3da

user4543 authored Apr 25, 2022

**Description**
Fix bug of duration feature for model benchmarks in distributed mode.

**Major Revision**
- Add all_reduce to sync the result of is_finished(the function to judge whether the model benchmark should be stopped) in each step 
  - to avoid inconsistency between different ranks to determine duration end (some rank may enter one more step and can never finish)
- Add torch.cuda.synchronize() before and after step time measuring in train_step() for all model benchmarks
  - some operations in train_step() maybe async resulting incorrect step time records (for example, lstm)

b5b1c3da

21 Apr, 2022 1 commit

Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) · 6456bcad

user4543 authored Apr 21, 2022

**Description**
Fix bugs in sync results on root rank for e2e model benchmarks.

Bugs:
 - results were not changed to sync results (grammer)
 - sync results not applyed to all ranks but only root rank
 - output result on local_rank 0 not global root rank

6456bcad

19 Apr, 2022 1 commit

Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) · c770ed5d

user4543 authored Apr 19, 2022

**Description**
 Support regex in annotations of benchmark naming for metrics in rules.
For example:      
metrics:   
 \- model-benchmarks:resnet50:float/.\*/fp16_train_throughput' 
 -> 
 \- 'model-benchmarks:.\*/.\*/fp16_train_throughput'

c770ed5d

18 Apr, 2022 1 commit

Bug - Support no matching rules and unify the output name in result_summary (#345) · 4fae2218

user4543 authored Apr 18, 2022

**Description**
Support no matching rules and unify the output name in result_summary

**Major Revision**
- Support rule with no matched metrics in result summary
- Unify output file name to 'results-summary'

4fae2218

16 Apr, 2022 1 commit
- Bug - Force to fix ort version as '1.10.0' (#343) · 262697cb
  user4543 authored Apr 16, 2022
```
**Description**
Force to fix ort version as '1.10.0'.
```
  262697cb
11 Apr, 2022 2 commits

Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338) · 80dcc8aa

guoshzhao authored Apr 11, 2022

**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh

80dcc8aa

CLI - Integrate output all nodes diagnosis results (#339) · 8dc19ca4
user4543 authored Apr 11, 2022
```
**Description**
Integrate output all nodes diagnosis results.
```
8dc19ca4

10 Apr, 2022 1 commit
- Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336) · 55b0f9d2
  user4543 authored Apr 10, 2022
```
**Description**
Output results of all nodes in data diagnosis.
```
  55b0f9d2
08 Apr, 2022 2 commits

Docs - Add usage for result summary (#337) · 56c9a711
user4543 authored Apr 09, 2022
```
**Description**
Add usage for result summary.
```
56c9a711

CLI - Integrage result summary and update output format of data diagnosis (#335) · f15da60b

user4543 authored Apr 08, 2022

**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis

f15da60b

01 Apr, 2022 1 commit

Benchmarks: Add Feature - Provide option to save raw data into file. (#333) · 6d895da8

guoshzhao authored Apr 01, 2022

**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.

6d895da8

31 Mar, 2022 1 commit

Bump minimist from 1.2.5 to 1.2.6 in /website (#334) · d368d90e

dependabot[bot] authored Mar 31, 2022

Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6.
- [Release notes](https://github.com/substack/minimist/releases)
- [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6

)

---
updated-dependencies:
- dependency-name: minimist
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

d368d90e

24 Mar, 2022 1 commit

Analyzer: Add feature - Add result summary in excel,md,html format (#320) · 84fed1ce

user4543 authored Mar 24, 2022

**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.

84fed1ce

22 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time (#332) · c5aa4f4e
  user4543 authored Mar 22, 2022
```
**Description**
Remove fp16 samples type converting time for training cnn and lstm inference.
```
  c5aa4f4e
21 Mar, 2022 1 commit

Config - Add inference config for NC A100 and NV A10 series (#329) · a9634ef5

Yifan Xiong authored Mar 21, 2022

Add inference config for preview SKUs, including:
* [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series)
* [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)

a9634ef5

17 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330) · 6e749180
  user4543 authored Mar 17, 2022
```
**Description**
Remove fp16  samples type converting time for cnn and lstm models.
```
  6e749180
16 Mar, 2022 1 commit

Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce

rafsalas19 authored Mar 16, 2022

**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn

ff51a3ce

15 Mar, 2022 2 commits

Bug: Executor - fix bug in result writing to files for mpi mode (#328) · 84359fd8
user4543 authored Mar 16, 2022
```
**Description**
fix the bug in result writing to files for mpi mode.
```
84359fd8

Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18

user4543 authored Mar 15, 2022

**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis

b3c95f18

09 Mar, 2022 1 commit

Bug - Fix env path to absolute path (#327) · f755c0b6

Yifan Xiong authored Mar 09, 2022

Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.

f755c0b6

07 Mar, 2022 2 commits

Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321) · 1ec055e1
user4543 authored Mar 07, 2022
```
**Description**
Abstract RuleBase from DataDiagnosis.
```
1ec055e1

Bump url-parse from 1.5.8 to 1.5.10 in /website (#323) · 97595271

dependabot[bot] authored Mar 07, 2022

Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10

)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

97595271

06 Mar, 2022 1 commit

Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322) · a9ef0f99

Jeff Daily authored Mar 06, 2022

**Description**
The BatchNorm operator is not numerically stable in fp16.  PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models.  Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32.  Preserving BN in fp32 for superbench more accurately reflects real workloads.

a9ef0f99

28 Feb, 2022 2 commits

Dockerfile - Add dockerfile for rocm5.0.1 (#319) · 425b9ff8
user4543 authored Feb 28, 2022
```
**Description**
Add dockerfile for rocm5.0.1.
```
425b9ff8

Bump prismjs from 1.23.0 to 1.27.0 in /website (#318) · 74a3b123

dependabot[bot] authored Feb 28, 2022

Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0

)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

74a3b123

25 Feb, 2022 1 commit
- Dockerfile - Add rocm5.0 dockerfile (#307) · a4950a70
  user4543 authored Feb 26, 2022
```
**Description**
Add rocm5.0 dockerfile.
```
  a4950a70
24 Feb, 2022 2 commits
- Bug Fix - Fix P2P detection in gpu_copy (#317) · 01304706
  Ziyue Yang authored Feb 25, 2022
```
**Description**
Fix invalid reference of P2P detection result in gpu_copy.
```
  01304706
- Benchmarks: Build Pipeline - Make gpcnet only for cuda (#316) · 4f5027db
  user4543 authored Feb 24, 2022
```
**Description**
Make gpcnet only for cuda.
```
  4f5027db
22 Feb, 2022 1 commit

Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315) · e0c49142

user4543 authored Feb 22, 2022

**Description**
Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0.
Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364

e0c49142

21 Feb, 2022 1 commit

Bump url-parse from 1.5.1 to 1.5.8 in /website (#313) · 0740780b

dependabot[bot] authored Feb 21, 2022

Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8

)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0740780b

20 Feb, 2022 2 commits

Config - Add T4 configurations for inference (#311) · ea2c10ab
Yifan Xiong authored Feb 20, 2022
```
Add T4 configurations for inference.
```
ea2c10ab

Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289) · 97ed12f9

user4543 authored Feb 20, 2022

**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests

97ed12f9

15 Feb, 2022 2 commits

Bug - Fix env file path (#310) · 1f48268b
Yifan Xiong authored Feb 15, 2022
```
Fix env file path for `docker run`.
```
1f48268b

Bump follow-redirects from 1.14.7 to 1.14.8 in /website (#309) · 53fe0c47

dependabot[bot] authored Feb 15, 2022

Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8

)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

53fe0c47

10 Feb, 2022 1 commit

Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305) · e31b8c9e

user4543 authored Feb 10, 2022

**Description**
Add support for pytorch>=1.9.0 of init_process_group.

**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run

e31b8c9e

09 Feb, 2022 2 commits
- Benchmarks: Build Pipeline - Update rccl-tests submodule to fix divide by zero error (#306) · 4abda6f5
  user4543 authored Feb 09, 2022
```
**Description**
Update rccl-tests submodule to fix divide by zero error.
```
  4abda6f5
- Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302) · 6cdf7595
  Ziyue Yang authored Feb 09, 2022
```
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
```
  6cdf7595
08 Feb, 2022 2 commits
- Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299) · 433785fd
  Ziyue Yang authored Feb 08, 2022
```
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
```
  433785fd
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
07 Feb, 2022 1 commit

Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055

Ziyue Yang authored Feb 07, 2022

**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.

85389055