Commits · v0.5.0-rc1 · tsoc / superbenchmark

24 Mar, 2022 1 commit

Analyzer: Add feature - Add result summary in excel,md,html format (#320) · 84fed1ce

user4543 authored Mar 24, 2022

**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.

84fed1ce

22 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time (#332) · c5aa4f4e
  user4543 authored Mar 22, 2022
```
**Description**
Remove fp16 samples type converting time for training cnn and lstm inference.
```
  c5aa4f4e
21 Mar, 2022 1 commit

Config - Add inference config for NC A100 and NV A10 series (#329) · a9634ef5

Yifan Xiong authored Mar 21, 2022

Add inference config for preview SKUs, including:
* [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series)
* [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)

a9634ef5

17 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330) · 6e749180
  user4543 authored Mar 17, 2022
```
**Description**
Remove fp16  samples type converting time for cnn and lstm models.
```
  6e749180
16 Mar, 2022 1 commit

Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce

rafsalas19 authored Mar 16, 2022

**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn

ff51a3ce

15 Mar, 2022 2 commits

Bug: Executor - fix bug in result writing to files for mpi mode (#328) · 84359fd8
user4543 authored Mar 16, 2022
```
**Description**
fix the bug in result writing to files for mpi mode.
```
84359fd8

Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18

user4543 authored Mar 15, 2022

**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis

b3c95f18

09 Mar, 2022 1 commit

Bug - Fix env path to absolute path (#327) · f755c0b6

Yifan Xiong authored Mar 09, 2022

Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.

f755c0b6

07 Mar, 2022 2 commits

Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321) · 1ec055e1
user4543 authored Mar 07, 2022
```
**Description**
Abstract RuleBase from DataDiagnosis.
```
1ec055e1

Bump url-parse from 1.5.8 to 1.5.10 in /website (#323) · 97595271

dependabot[bot] authored Mar 07, 2022

Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10

)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

97595271

06 Mar, 2022 1 commit

Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322) · a9ef0f99

Jeff Daily authored Mar 06, 2022

**Description**
The BatchNorm operator is not numerically stable in fp16.  PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models.  Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32.  Preserving BN in fp32 for superbench more accurately reflects real workloads.

a9ef0f99

28 Feb, 2022 2 commits

Dockerfile - Add dockerfile for rocm5.0.1 (#319) · 425b9ff8
user4543 authored Feb 28, 2022
```
**Description**
Add dockerfile for rocm5.0.1.
```
425b9ff8

Bump prismjs from 1.23.0 to 1.27.0 in /website (#318) · 74a3b123

dependabot[bot] authored Feb 28, 2022

Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0.
- [Release notes](https://github.com/PrismJS/prism/releases)
- [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md)
- [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0

)

---
updated-dependencies:
- dependency-name: prismjs
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

74a3b123

25 Feb, 2022 1 commit
- Dockerfile - Add rocm5.0 dockerfile (#307) · a4950a70
  user4543 authored Feb 26, 2022
```
**Description**
Add rocm5.0 dockerfile.
```
  a4950a70
24 Feb, 2022 2 commits
- Bug Fix - Fix P2P detection in gpu_copy (#317) · 01304706
  Ziyue Yang authored Feb 25, 2022
```
**Description**
Fix invalid reference of P2P detection result in gpu_copy.
```
  01304706
- Benchmarks: Build Pipeline - Make gpcnet only for cuda (#316) · 4f5027db
  user4543 authored Feb 24, 2022
```
**Description**
Make gpcnet only for cuda.
```
  4f5027db
22 Feb, 2022 1 commit

Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315) · e0c49142

user4543 authored Feb 22, 2022

**Description**
Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0.
Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364

e0c49142

21 Feb, 2022 1 commit

Bump url-parse from 1.5.1 to 1.5.8 in /website (#313) · 0740780b

dependabot[bot] authored Feb 21, 2022

Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8.
- [Release notes](https://github.com/unshiftio/url-parse/releases)
- [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8

)

---
updated-dependencies:
- dependency-name: url-parse
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0740780b

20 Feb, 2022 2 commits

Config - Add T4 configurations for inference (#311) · ea2c10ab
Yifan Xiong authored Feb 20, 2022
```
Add T4 configurations for inference.
```
ea2c10ab

Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289) · 97ed12f9

user4543 authored Feb 20, 2022

**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests

97ed12f9

15 Feb, 2022 2 commits

Bug - Fix env file path (#310) · 1f48268b
Yifan Xiong authored Feb 15, 2022
```
Fix env file path for `docker run`.
```
1f48268b

Bump follow-redirects from 1.14.7 to 1.14.8 in /website (#309) · 53fe0c47

dependabot[bot] authored Feb 15, 2022

Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8

)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

53fe0c47

10 Feb, 2022 1 commit

Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305) · e31b8c9e

user4543 authored Feb 10, 2022

**Description**
Add support for pytorch>=1.9.0 of init_process_group.

**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run

e31b8c9e

09 Feb, 2022 2 commits
- Benchmarks: Build Pipeline - Update rccl-tests submodule to fix divide by zero error (#306) · 4abda6f5
  user4543 authored Feb 09, 2022
```
**Description**
Update rccl-tests submodule to fix divide by zero error.
```
  4abda6f5
- Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302) · 6cdf7595
  Ziyue Yang authored Feb 09, 2022
```
**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
```
  6cdf7595
08 Feb, 2022 2 commits
- Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299) · 433785fd
  Ziyue Yang authored Feb 08, 2022
```
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
```
  433785fd
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
07 Feb, 2022 1 commit

Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055

Ziyue Yang authored Feb 07, 2022

**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.

85389055

30 Jan, 2022 1 commit
- Bug - Fix typo in document (#297) · 28195be6
  Yuting Jiang authored Jan 30, 2022
```
Fix typo in document.
```
  28195be6
29 Jan, 2022 3 commits
- Benchmarks - Support T4 and A10 in GEMM benchmark (#294) · 3419447c
  Yifan Xiong authored Jan 29, 2022
```
Support T4 and A10 in GEMM benchmark.
```
  3419447c
- Config - Support customized env for all modes (#295) · 3524975c
  Yifan Xiong authored Jan 29, 2022
```
Support customized env for all modes in configuration.
```
  3524975c
- Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296) · f3d05006
  Ziyue Yang authored Jan 29, 2022
```
Fix bug of GPU scan logic in bidirectional tests.
```
  f3d05006
28 Jan, 2022 2 commits

Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287) · d03d110f

guoshzhao authored Jan 28, 2022

**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.

d03d110f

Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288) · d877ca23

guoshzhao authored Jan 28, 2022

**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.

d877ca23

27 Jan, 2022 1 commit
- Config - Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml (#292) · f283b536
  Yuting Jiang authored Jan 28, 2022
```
**Description**
Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml
```
  f283b536
25 Jan, 2022 1 commit

Config - Update benchmark naming to support annotations (#284) · 7d7cd3dc

Yifan Xiong authored Jan 25, 2022

__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.

7d7cd3dc

24 Jan, 2022 2 commits

Bug: Fix code insecure issue that binds a socket to all network interfaces (#291) · 35fc06eb
Yuting Jiang authored Jan 24, 2022
```
**Description**
Fix code insecure issue that binds a socket to all network interfaces.
```
35fc06eb

Bug: Fix code incesure issue of integer overflow in cublas function (#290) · 380ce400

Yuting Jiang authored Jan 24, 2022

**Description**
Fix insecure issue of Multiplication result converted to larger type.

**Major Revision**
- Use a cast to ensure that the multiplication is done using the long long to avoid overflow.

380ce400

23 Jan, 2022 1 commit

Bump nanoid from 3.1.23 to 3.2.0 in /website (#286) · 5f6ad0cd

dependabot[bot] authored Jan 23, 2022

Bumps [nanoid](https://github.com/ai/nanoid) from 3.1.23 to 3.2.0.
- [Release notes](https://github.com/ai/nanoid/releases)
- [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md)
- [Commits](https://github.com/ai/nanoid/compare/3.1.23...3.2.0

)

---
updated-dependencies:
- dependency-name: nanoid
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

5f6ad0cd

21 Jan, 2022 1 commit

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285) · 74421ffe

Ziyue Yang authored Jan 21, 2022

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

74421ffe