- 06 Sep, 2022 2 commits
-
-
Yifan Xiong authored
* Update release note * Update required Docker version (thanks to @chhwang)
-
Yifan Xiong authored
* Upgrade package versions * Add release note for v0.6.0
-
- 05 Sep, 2022 2 commits
-
-
Yuting Jiang authored
**Description** Format int type and unify np.nan in diagnosis output files. **Major Revision** - format all int columns - unify na values to 'N/A' in json,jsonl,md,html files
-
Yang Wang authored
**Description** As MPI can not be inited twice in one same process (by py and c) Also, MPI env initialized by mpi4py can not be reused in C env To avoid MPI init issue introduced from mpi4py, rewrite gen_ibstat_file function to generate ibstat file leveraged by pssh **Major Revision** - Rewrite gen_ibstat_file function to generate ibstat file leveraged by pssh **Minor Revision** - Remove mpi4py dependency Tested the functionality of topo-aware on 36 nodes cluster
-
- 02 Sep, 2022 2 commits
-
-
Yifan Xiong authored
Enhance timeout cleanup to avoid possible hanging. __Major Revisions__ * Skip postprocess (mainly torch.dist.barrier and destroy) when exception happens (e.g., timeout, GPU crashed) to avoid subprocesses hanging. * Add cleanup to kill sb exec processes when Ansible run failed for certain benchmark. __Minor Revisions__ * Update extra Ansible timeout from 300s to 60s.
-
Yuting Jiang authored
**Description** Make baseline check optional in data diagnosis and fix bugs. **Major Revision** - make baseline file optional in data diagnosis - fix bugs of output in md and excel format when 'function' is not in the rule - fix bug in multi_rules function that miss/failed test may failed the whole process **Minor Revision** - revise doc related with data diagnosis - resolve warning message about baseline not found check, only raise exception if baseline not found in the 'variance' function - move summary fields into top of json file - unify 'Index','machine' -> 'index' in output file
-
- 01 Sep, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Update error handling to support exit code of sb result diagnosis. **Major Revision** - raise exception for any error to make exit_code=1
-
- 31 Aug, 2022 1 commit
-
-
Yifan Xiong authored
Upgrade colorlog for [`$NO_COLOR`](https://no-color.org/) support.
-
- 26 Aug, 2022 1 commit
-
-
Yifan Xiong authored
Update apt packages in dockerfile, support sudo and ip commands.
-
- 25 Aug, 2022 2 commits
-
-
Yifan Xiong authored
Enhance parameter parsing to support string like `"--arg1 value --arg2 'a long string with spaces'"`.
-
Yang Wang authored
Enable ib latency test in ib traffic validation distributed benchmark As Perftest supports CUDA only in BW tests (Refer [perftest source code](https://github.com/linux-rdma/perftest/blob/23f7f8a56892bd9e00b6a4e8bd4bbfdbe122af47/src/perftest_parameters.c#L1652)) Remove `--use_cuda` option from cmd prefix if the test command is bw related
-
- 23 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support to store values of metrics in data diagnosis. Take the following rules as example: ``` nccl_store_rule: categories: NCCL_DIS store: True metrics: - nccl-bw:allreduce-run0/allreduce_1073741824_busbw - nccl-bw:allreduce-run1/allreduce_1073741824_busbw - nccl-bw:allreduce-run2/allreduce_1073741824_busbw - nccl-bw:allreduce-run3/allreduce_1073741824_busbw - nccl-bw:allreduce-run4/allreduce_1073741824_busbw nccl_rule: function: multi_rules criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False' categories: NCCL_DIS ``` **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
-
- 22 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support for both jsonl and json format in data diagnosis. **Major Revision** - Add support for both jsonl and json format in data diagnosis **Minor Revision** - change related doc - add jsonl support in cli
-
- 17 Aug, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Update Python setup for require packages. __Major Revisions__ * downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6 * add extra entry in extras_require for nested packages * update `pip install` contents accordingly
-
- 16 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Degrade perftest submodule to v4.4-0.37 to fix stability issue. Issue: rdma-loopback is not stable on public version(v0.5/v0.6-rc1) Docker Version: v0.6-rc1-cuda11.1 Testbed: 8 A100 40GB GPUs (1 NDv4 node) Result: New perftest version introduce the variance, max-min/mean = 2% for v4.4-0.37, 8% for v4.5-0.2
-
- 13 Aug, 2022 1 commit
-
-
Yang Wang authored
An enhancement for topo-aware IB performance validation #373. This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.
-
- 09 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Rename field in data diagnosis to be more readable. **Major Revision** - rename fields according to diagnosis/metric format **Minor Revision** - change type of diagnosis/issue_num to be int
-
- 08 Aug, 2022 1 commit
-
-
Yifan Xiong authored
Fix minimum timeout: use 60s if config is shorter.
-
- 04 Aug, 2022 1 commit
-
-
Yifan Xiong authored
* Gracefully exit when timeout, add corresponding log and return code. * Set minimum timeout to 1 minute and enlarge Ansible timeout.
-
- 01 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add failure check feature in data diagnosis. **Major Revision** - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest - Split performance issue and failedtest in categories **Minor Revision** - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
-
- 26 Jul, 2022 1 commit
-
-
Jie Zhang authored
* Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com>
-
- 25 Jul, 2022 1 commit
-
-
Yang Wang authored
Fix an unexpected result value (`-0.125`) issue in ib traffic benchmark when encountering `-1` in raw output * Check if the value is valid before the base conversion * Add a test case to cover this situation
-
- 22 Jul, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1. - [Release notes](https://github.com/terser/terser/releases) - [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md) - [Commits](https://github.com/terser/terser/commits ) --- updated-dependencies: - dependency-name: terser dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 20 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix potential port conflict due to race condition between time-to-check to time-to-use, by binding the port all through. Modify the function to resolve flake8 C901 while keeping the logic same.
-
- 13 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Add dependencies * include ndv4-topo.xml in cuda docker images * require requests version to avoid RequestsDependencyWarning
-
- 09 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix several issues in ib validation benchmark: * continue running when timeout in the middle, instead of aborting whole mpi process * make timeout parameter configurable, set default to 120 seconds * avoid mixture of stdio and iostream when print to stdout * set default message size to 8M which will saturate ib in most cases * fix hostfile path issue so that it can be auto found in different cases
-
- 08 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in both 1 node and all nodes in one config by changing `node_num`. Update docs and add test case accordingly.
-
- 06 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Update dependencies and Dockerfile: * upgrade nccl-tests and rccl-tests to current latest version to match NCCL/RCCL versions * unify image tag names on DockerHub * remove verbose output in Dockerfile and minor fix some flags
-
- 05 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
-
- 29 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix several issues in ib loopback benchmark: * use `--report_gbits` and divide by 8 to get GB/s, previous results are MiB/s / 1000 * use the ib_write_bw binary built in third_party instead of system path * update the metrics name so that different hca indices have same metric
-
Yifan Xiong authored
Refine error message when GPU is not detected. Possible solutions if hardware exists and drivers are already installed: * nvidia gpus: ```sh /sbin/modprobe nvidia-uvm D=`grep nvidia-uvm /proc/devices | awk '{print $1}'` mknod -m 666 /dev/nvidia-uvm c $D 0 ``` * amd gpus ```sh modprobe amdgpu ```
-
- 24 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix incorrect ulimit nofile config in Dockerfile. Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.
-
Yifan Xiong authored
**Description** Support multiple IB/GPU devices run simultaneously in ib validation benchmark. **Major Revisions** - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel. - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes. - Fix env issues in Dockerfile for end-to-end test. - Update ib-traffic configuration examples in config files. - Update unit tests and docs accordingly. Closes #326.
-
- 19 Jun, 2022 2 commits
-
-
Yifan Xiong authored
Fix sudo issue when running without Docker, user account could be arbitrary in such case.
-
Yifan Xiong authored
**Description** Update ROCm Dockerfile. **Major Revisions** - Add dockerfile for ROCm 5.1.3 - Merge 5.1.x and 5.0.x dockerfile - Remove 4.2 and 4.0 legacy - Update build pipeline accordingly
-
- 15 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Fix cmake and build issues. **Major Revision** * Remove unnecessary boost build * Remove user-agent for mlc * Remove -j for third party to build each project in sequence * Fix ansible collections installation path
-
- 14 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Support `sb run` on host directly without Docker **Major Revisions** - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.
-
- 06 Jun, 2022 1 commit
-
-
dependabot[bot] authored
Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.1.0 to 1.1.1. - [Release notes](https://github.com/EventSource/eventsource/releases) - [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md) - [Commits](https://github.com/EventSource/eventsource/compare/v1.1.0...v1.1.1 ) --- updated-dependencies: - dependency-name: eventsource dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 02 Jun, 2022 2 commits
-
-
dependabot[bot] authored
Bumps [cross-fetch](https://github.com/lquixada/cross-fetch) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/lquixada/cross-fetch/releases) - [Commits](https://github.com/lquixada/cross-fetch/compare/v3.1.4...v3.1.5 ) --- updated-dependencies: - dependency-name: cross-fetch dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [async](https://github.com/caolan/async) from 2.6.3 to 2.6.4. - [Release notes](https://github.com/caolan/async/releases) - [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md) - [Commits](https://github.com/caolan/async/compare/v2.6.3...v2.6.4 ) --- updated-dependencies: - dependency-name: async dependency-type: indirect ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-