- 28 Apr, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Upgrade version and release note. __Major Revision__ - Upgrade package versions - Add release note for v0.5.0
-
- 25 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Fix bug of duration feature for model benchmarks in distributed mode. **Major Revision** - Add all_reduce to sync the result of is_finished(the function to judge whether the model benchmark should be stopped) in each step - to avoid inconsistency between different ranks to determine duration end (some rank may enter one more step and can never finish) - Add torch.cuda.synchronize() before and after step time measuring in train_step() for all model benchmarks - some operations in train_step() maybe async resulting incorrect step time records (for example, lstm)
-
- 21 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Fix bugs in sync results on root rank for e2e model benchmarks. Bugs: - results were not changed to sync results (grammer) - sync results not applyed to all ranks but only root rank - output result on local_rank 0 not global root rank
-
- 19 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Support regex in annotations of benchmark naming for metrics in rules. For example: metrics: \- model-benchmarks:resnet50:float/.\*/fp16_train_throughput' -> \- 'model-benchmarks:.\*/.\*/fp16_train_throughput'
-
- 18 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Support no matching rules and unify the output name in result_summary **Major Revision** - Support rule with no matched metrics in result summary - Unify output file name to 'results-summary'
-
- 16 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Force to fix ort version as '1.10.0'.
-
- 11 Apr, 2022 2 commits
-
-
guoshzhao authored
**Description** Integrate FAMBench into superbench based on docker implementation: https://github.com/facebookresearch/FAMBench The script to run all benchmarks is: https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
-
user4543 authored
**Description** Integrate output all nodes diagnosis results.
-
- 10 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Output results of all nodes in data diagnosis.
-
- 08 Apr, 2022 2 commits
- 01 Apr, 2022 1 commit
-
-
guoshzhao authored
**Description** Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
-
- 31 Mar, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6. - [Release notes](https://github.com/substack/minimist/releases) - [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6 ) --- updated-dependencies: - dependency-name: minimist dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 24 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add result summary in excel,md,html format. **Major Revision** - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
-
- 22 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Remove fp16 samples type converting time for training cnn and lstm inference.
-
- 21 Mar, 2022 1 commit
-
-
Yifan Xiong authored
Add inference config for preview SKUs, including: * [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series) * [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)
-
- 17 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Remove fp16 samples type converting time for cnn and lstm models.
-
- 16 Mar, 2022 1 commit
-
-
rafsalas19 authored
**Description** Modifications adding GPU-Burn to SuperBench. - added third party submodule - modified Makefile to make gpu-burn binary - added/modified microbenchmarks to add gpu-burn python scripts - modified default and azure_ndv4 configs to add gpu-burn
-
- 15 Mar, 2022 2 commits
-
-
user4543 authored
**Description** fix the bug in result writing to files for mpi mode.
-
user4543 authored
**Description** Add md and html output format for DataDiagnosis. **Major Revision** - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output **Minor Revision** - move excel and json output interface into DataDiagnosis
-
- 09 Mar, 2022 1 commit
-
-
Yifan Xiong authored
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
-
- 07 Mar, 2022 2 commits
-
-
user4543 authored
**Description** Abstract RuleBase from DataDiagnosis.
-
dependabot[bot] authored
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10 ) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 06 Mar, 2022 1 commit
-
-
Jeff Daily authored
**Description** The BatchNorm operator is not numerically stable in fp16. PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models. Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32. Preserving BN in fp32 for superbench more accurately reflects real workloads.
-
- 28 Feb, 2022 2 commits
-
-
user4543 authored
**Description** Add dockerfile for rocm5.0.1.
-
dependabot[bot] authored
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0. - [Release notes](https://github.com/PrismJS/prism/releases) - [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md) - [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0 ) --- updated-dependencies: - dependency-name: prismjs dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 25 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add rocm5.0 dockerfile.
-
- 24 Feb, 2022 2 commits
-
-
Ziyue Yang authored
**Description** Fix invalid reference of P2P detection result in gpu_copy.
-
user4543 authored
**Description** Make gpcnet only for cuda.
-
- 22 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0. Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364
-
- 21 Feb, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8 ) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 20 Feb, 2022 2 commits
-
-
Yifan Xiong authored
Add T4 configurations for inference.
-
user4543 authored
**Description** Add multi-rules feature for data diagnosis to support multiple rules' combined check. **Major Revision** - revise rule design to support multiple rules combination check - update related codes and tests
-
- 15 Feb, 2022 2 commits
-
-
Yifan Xiong authored
Fix env file path for `docker run`.
-
dependabot[bot] authored
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8. - [Release notes](https://github.com/follow-redirects/follow-redirects/releases) - [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8 ) --- updated-dependencies: - dependency-name: follow-redirects dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 10 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add support for pytorch>=1.9.0 of init_process_group. **Major Revision** - Use PrefixStore(TCPStore) to init_process_group manully for each model run
-
- 09 Feb, 2022 2 commits
-
-
user4543 authored
**Description** Update rccl-tests submodule to fix divide by zero error.
-
Ziyue Yang authored
**Description** This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
-
- 08 Feb, 2022 2 commits
-
-
Ziyue Yang authored
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
-
Ziyue Yang authored
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
-