- 09 Mar, 2022 1 commit
-
-
Yifan Xiong authored
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
-
- 07 Mar, 2022 2 commits
-
-
user4543 authored
**Description** Abstract RuleBase from DataDiagnosis.
-
dependabot[bot] authored
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.8 to 1.5.10. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.8...1.5.10 ) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 06 Mar, 2022 1 commit
-
-
Jeff Daily authored
**Description** The BatchNorm operator is not numerically stable in fp16. PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models. Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32. Preserving BN in fp32 for superbench more accurately reflects real workloads.
-
- 28 Feb, 2022 2 commits
-
-
user4543 authored
**Description** Add dockerfile for rocm5.0.1.
-
dependabot[bot] authored
Bumps [prismjs](https://github.com/PrismJS/prism) from 1.23.0 to 1.27.0. - [Release notes](https://github.com/PrismJS/prism/releases) - [Changelog](https://github.com/PrismJS/prism/blob/master/CHANGELOG.md) - [Commits](https://github.com/PrismJS/prism/compare/v1.23.0...v1.27.0 ) --- updated-dependencies: - dependency-name: prismjs dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 25 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add rocm5.0 dockerfile.
-
- 24 Feb, 2022 2 commits
-
-
Ziyue Yang authored
**Description** Fix invalid reference of P2P detection result in gpu_copy.
-
user4543 authored
**Description** Make gpcnet only for cuda.
-
- 22 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0. Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364
-
- 21 Feb, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [url-parse](https://github.com/unshiftio/url-parse) from 1.5.1 to 1.5.8. - [Release notes](https://github.com/unshiftio/url-parse/releases) - [Commits](https://github.com/unshiftio/url-parse/compare/1.5.1...1.5.8 ) --- updated-dependencies: - dependency-name: url-parse dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 20 Feb, 2022 2 commits
-
-
Yifan Xiong authored
Add T4 configurations for inference.
-
user4543 authored
**Description** Add multi-rules feature for data diagnosis to support multiple rules' combined check. **Major Revision** - revise rule design to support multiple rules combination check - update related codes and tests
-
- 15 Feb, 2022 2 commits
-
-
Yifan Xiong authored
Fix env file path for `docker run`.
-
dependabot[bot] authored
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.7 to 1.14.8. - [Release notes](https://github.com/follow-redirects/follow-redirects/releases) - [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.7...v1.14.8 ) --- updated-dependencies: - dependency-name: follow-redirects dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 10 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add support for pytorch>=1.9.0 of init_process_group. **Major Revision** - Use PrefixStore(TCPStore) to init_process_group manully for each model run
-
- 09 Feb, 2022 2 commits
-
-
user4543 authored
**Description** Update rccl-tests submodule to fix divide by zero error.
-
Ziyue Yang authored
**Description** This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
-
- 08 Feb, 2022 2 commits
-
-
Ziyue Yang authored
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
-
Ziyue Yang authored
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
-
- 07 Feb, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit does the following to optimize result variance in gpu_copy benchmark: 1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead; 2) Use CUDA events for timing instead of CPU timestamps; 3) Make data checking an option that is not preferred to be enabled in performance test; 4) Enlarge message size in performance benchmark.
-
- 30 Jan, 2022 1 commit
-
-
Yuting Jiang authored
Fix typo in document.
-
- 29 Jan, 2022 3 commits
-
-
Yifan Xiong authored
Support T4 and A10 in GEMM benchmark.
-
Yifan Xiong authored
Support customized env for all modes in configuration.
-
Ziyue Yang authored
Fix bug of GPU scan logic in bidirectional tests.
-
- 28 Jan, 2022 2 commits
-
-
guoshzhao authored
**Description** Please write a brief description and link the related issue if have. **Major Revision** - Sync (do allreduce max) the E2E training results among all workers. - Avoid using ':0' in metric name if there has only one rank having output.
-
guoshzhao authored
**Description** Add timeout feature for each benchmark. **Major Revision** - Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future. - Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254: [ansible.py:80][WARNING] Run failed, return code 254. - Using `timeout` command to terminate the client process.
-
- 27 Jan, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Disable disk-benchmark in ndmv4.yaml and change batch size to 1 in default.yaml
-
- 25 Jan, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Update benchmark naming to support annotations. __Major Revisions__ - Update name for `create_benchmark_context` in executor. - Backward compatibility for model benchmarks using "_models" suffix. - Update documents.
-
- 24 Jan, 2022 2 commits
-
-
Yuting Jiang authored
**Description** Fix code insecure issue that binds a socket to all network interfaces.
-
Yuting Jiang authored
**Description** Fix insecure issue of Multiplication result converted to larger type. **Major Revision** - Use a cast to ensure that the multiplication is done using the long long to avoid overflow.
-
- 23 Jan, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [nanoid](https://github.com/ai/nanoid) from 3.1.23 to 3.2.0. - [Release notes](https://github.com/ai/nanoid/releases) - [Changelog](https://github.com/ai/nanoid/blob/main/CHANGELOG.md) - [Commits](https://github.com/ai/nanoid/compare/3.1.23...3.2.0 ) --- updated-dependencies: - dependency-name: nanoid dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 21 Jan, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
-
- 19 Jan, 2022 1 commit
-
-
guoshzhao authored
**Description** Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
-
- 18 Jan, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks. <details> <summary>Examples</summary> <pre> $ sb benchmark list -n [a-z]+-bw -o table Result -------- mem-bw nccl-bw rccl-bw </pre> <pre> $ sb benchmark list-parameters -n mem-bw === mem-bw === optional arguments: --bin_dir str Specify the directory of the benchmark binary. --duration int The elapsed time of benchmark in seconds. --mem_type str [str ...] Memory types to benchmark. E.g. htod dtoh dtod. --memory str Memory argument for bandwidthtest. E.g. pinned unpinned. --run_count int The run count of benchmark. --shmoo_mode Enable shmoo mode for bandwidthtest. default values: {'bin_dir': None, 'duration': 0, 'mem_type': ['htod', 'dtoh'], 'memory': 'pinned', 'run_count': 1} </pre> </details> __Major Revisions__ * Add `sb benchmark list` to list benchmarks matching given name. * Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name. __Minor Revisions__ * Sort format help text for argparse.
-
- 17 Jan, 2022 2 commits
-
-
dependabot[bot] authored
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.1 to 1.14.7. - [Release notes](https://github.com/follow-redirects/follow-redirects/releases) - [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.1...v1.14.7 ) --- updated-dependencies: - dependency-name: follow-redirects dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [shelljs](https://github.com/shelljs/shelljs) from 0.8.4 to 0.8.5. - [Release notes](https://github.com/shelljs/shelljs/releases) - [Changelog](https://github.com/shelljs/shelljs/blob/master/CHANGELOG.md) - [Commits](https://github.com/shelljs/shelljs/compare/v0.8.4...v0.8.5 ) --- updated-dependencies: - dependency-name: shelljs dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 30 Dec, 2021 1 commit
-
-
Yifan Xiong authored
__Description__ Cherry-pick bug fixes from v0.4.0 to main. __Major Revisions__ * Bug - Fix issues for Ansible and benchmarks (#267) * Tests - Refine test cases for microbenchmark (#268) * Bug - Build openmpi with ucx support in rocm dockerfiles (#269) * Benchmarks: Fix Bug - Fix fio build issue (#272) * Docs - Unify metric and add doc for cublas and cudnn functions (#271) * Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274) * Bug - Fix bug of detecting if gpu_index is none (#275) * Bug - Fix bugs in data diagnosis (#273) * Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270) * Benchmarks: Configuration - Update inference and network benchmarks in configs (#276) * Docs - Upgrade version and release note (#277) Co-authored-by:Yuting Jiang <v-yutjiang@microsoft.com>
-
- 14 Dec, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Add usage for data diagnosis.
-
- 13 Dec, 2021 1 commit
-
-
guoshzhao authored
**Description** Update docs for monitor.
-