- 20 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix potential port conflict due to race condition between time-to-check to time-to-use, by binding the port all through. Modify the function to resolve flake8 C901 while keeping the logic same.
-
- 13 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Add dependencies * include ndv4-topo.xml in cuda docker images * require requests version to avoid RequestsDependencyWarning
-
- 09 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix several issues in ib validation benchmark: * continue running when timeout in the middle, instead of aborting whole mpi process * make timeout parameter configurable, set default to 120 seconds * avoid mixture of stdio and iostream when print to stdout * set default message size to 8M which will saturate ib in most cases * fix hostfile path issue so that it can be auto found in different cases
-
- 08 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in both 1 node and all nodes in one config by changing `node_num`. Update docs and add test case accordingly.
-
- 06 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Update dependencies and Dockerfile: * upgrade nccl-tests and rccl-tests to current latest version to match NCCL/RCCL versions * unify image tag names on DockerHub * remove verbose output in Dockerfile and minor fix some flags
-
- 05 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
-
- 29 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix several issues in ib loopback benchmark: * use `--report_gbits` and divide by 8 to get GB/s, previous results are MiB/s / 1000 * use the ib_write_bw binary built in third_party instead of system path * update the metrics name so that different hca indices have same metric
-
Yifan Xiong authored
Refine error message when GPU is not detected. Possible solutions if hardware exists and drivers are already installed: * nvidia gpus: ```sh /sbin/modprobe nvidia-uvm D=`grep nvidia-uvm /proc/devices | awk '{print $1}'` mknod -m 666 /dev/nvidia-uvm c $D 0 ``` * amd gpus ```sh modprobe amdgpu ```
-
- 24 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix incorrect ulimit nofile config in Dockerfile. Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
-
Yifan Xiong authored
**Description** Support multiple IB/GPU devices run simultaneously in ib validation benchmark. **Major Revisions** - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel. - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes. - Fix env issues in Dockerfile for end-to-end test. - Update ib-traffic configuration examples in config files. - Update unit tests and docs accordingly. Closes #326.
-
- 19 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix sudo issue when running without Docker, user account could be arbitrary in such case.
-
Yifan Xiong authored
**Description** Update ROCm Dockerfile. **Major Revisions** - Add dockerfile for ROCm 5.1.3 - Merge 5.1.x and 5.0.x dockerfile - Remove 4.2 and 4.0 legacy - Update build pipeline accordingly
-
- 15 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Fix cmake and build issues. **Major Revision** * Remove unnecessary boost build * Remove user-agent for mlc * Remove -j for third party to build each project in sequence * Fix ansible collections installation path
-
- 14 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Support `sb run` on host directly without Docker **Major Revisions** - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.
-
- 06 Jun, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.1.0 to 1.1.1. - [Release notes](https://github.com/EventSource/eventsource/releases) - [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md) - [Commits](https://github.com/EventSource/eventsource/compare/v1.1.0...v1.1.1 ) --- updated-dependencies: - dependency-name: eventsource dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 02 Jun, 2022 2 commits
-
-
dependabot[bot] authored
Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/lquixada/cross-fetch/releases) - [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5 ) --- updated-dependencies: - dependency-name: cross-fetch dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [async](https://github.com/caolan/async) from 2.6.3 to 2.6.4. - [Release notes](https://github.com/caolan/async/releases) - [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md) - [Commits](https://github.com/caolan/async/compare/v2.6.3...v2.6.4 ) --- updated-dependencies: - dependency-name: async dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 01 Jun, 2022 1 commit
-
-
user4543 authored
**Description** Fix bugs in data diagnosis. **Major Revision** - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0' - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True - fix bug of using wrong column index when applying format(red color and percentile) in the excel
-
- 31 May, 2022 1 commit
-
-
user4543 authored
**Description** Add support to run sb command inside docker image - install missing dependency.
-
- 27 May, 2022 1 commit
-
-
user4543 authored
**Description** Update rccl version and fix issue in rocm5.1.1 dockerfile.
-
- 25 May, 2022 1 commit
-
-
user4543 authored
**Description** Add dockerfile for rocm5.1.1.
-
- 29 Apr, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.5.0 to main. **Major Revisions** * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by:Yuting Jiang <v-yutjiang@microsoft.com>
-
- 20 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Update links of referencing other docs using relative file paths with extensions.
-
- 15 Apr, 2022 1 commit
-
-
Jared Bowden authored
**Description** Fixes relative link in documentation: point to `../cli.md`.
-
- 11 Apr, 2022 2 commits
-
-
guoshzhao authored
**Description** Integrate FAMBench into superbench based on docker implementation: https://github.com/facebookresearch/FAMBench The script to run all benchmarks is: https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
-
user4543 authored
**Description** Integrate output all nodes diagnosis results.
-
- 10 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Output results of all nodes in data diagnosis.
-
- 08 Apr, 2022 2 commits
- 01 Apr, 2022 1 commit
-
-
guoshzhao authored
**Description** Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
-
- 31 Mar, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [minimist](https://github.com/substack/minimist) from 1.2.5 to 1.2.6. - [Release notes](https://github.com/substack/minimist/releases) - [Commits](https://github.com/substack/minimist/compare/1.2.5...1.2.6 ) --- updated-dependencies: - dependency-name: minimist dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 24 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add result summary in excel,md,html format. **Major Revision** - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
-
- 22 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Remove fp16 samples type converting time for training cnn and lstm inference.
-
- 21 Mar, 2022 1 commit
-
-
Yifan Xiong authored
Add inference config for preview SKUs, including: * [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series) * [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)
-
- 17 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Remove fp16 samples type converting time for cnn and lstm models.
-
- 16 Mar, 2022 1 commit
-
-
rafsalas19 authored
**Description** Modifications adding GPU-Burn to SuperBench. - added third party submodule - modified Makefile to make gpu-burn binary - added/modified microbenchmarks to add gpu-burn python scripts - modified default and azure_ndv4 configs to add gpu-burn
-
- 15 Mar, 2022 2 commits
-
-
user4543 authored
**Description** fix the bug in result writing to files for mpi mode.
-
user4543 authored
**Description** Add md and html output format for DataDiagnosis. **Major Revision** - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output **Minor Revision** - move excel and json output interface into DataDiagnosis
-
- 09 Mar, 2022 1 commit
-
-
Yifan Xiong authored
Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.
-
- 07 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Abstract RuleBase from DataDiagnosis.
-