- 01 Sep, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Update error handling to support exit code of sb result diagnosis. **Major Revision** - raise exception for any error to make exit_code=1
-
- 25 Aug, 2022 2 commits
-
-
Yifan Xiong authored
Enhance parameter parsing to support string like `"--arg1 value --arg2 'a long string with spaces'"`.
-
Yang Wang authored
Enable ib latency test in ib traffic validation distributed benchmark As Perftest supports CUDA only in BW tests (Refer [perftest source code](https://github.com/linux-rdma/perftest/blob/23f7f8a56892bd9e00b6a4e8bd4bbfdbe122af47/src/perftest_parameters.c#L1652)) Remove `--use_cuda` option from cmd prefix if the test command is bw related
-
- 23 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support to store values of metrics in data diagnosis. Take the following rules as example: ``` nccl_store_rule: categories: NCCL_DIS store: True metrics: - nccl-bw:allreduce-run0/allreduce_1073741824_busbw - nccl-bw:allreduce-run1/allreduce_1073741824_busbw - nccl-bw:allreduce-run2/allreduce_1073741824_busbw - nccl-bw:allreduce-run3/allreduce_1073741824_busbw - nccl-bw:allreduce-run4/allreduce_1073741824_busbw nccl_rule: function: multi_rules criteria: 'lambda label:True if min(label["nccl_store_rule"].values())/max(label["nccl_store_rule"].values())<0.95 else False' categories: NCCL_DIS ``` **nccl_store_rule** will store the values of the metrics in dict and save them into `label["nccl_store_rule"]` , and then **rccl_rule** can use the values of metrics through `label["nccl_store_rule"].values()` in criteria
-
- 22 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add support for both jsonl and json format in data diagnosis. **Major Revision** - Add support for both jsonl and json format in data diagnosis **Minor Revision** - change related doc - add jsonl support in cli
-
- 09 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Rename field in data diagnosis to be more readable. **Major Revision** - rename fields according to diagnosis/metric format **Minor Revision** - change type of diagnosis/issue_num to be int
-
- 04 Aug, 2022 1 commit
-
-
Yifan Xiong authored
* Gracefully exit when timeout, add corresponding log and return code. * Set minimum timeout to 1 minute and enlarge Ansible timeout.
-
- 01 Aug, 2022 1 commit
-
-
Yuting Jiang authored
**Description** Add failure check feature in data diagnosis. **Major Revision** - Add failure check rule op to support that if there exists metric_regex not been matched by any metric in result, label as failedtest - Split performance issue and failedtest in categories **Minor Revision** - replace DataFrame.append() with pd.concat since append() will be removed in later version of pandas
-
- 26 Jul, 2022 1 commit
-
-
Jie Zhang authored
* Support topo-aware IB performance validation Add a new pattern `topo-aware`, so the user can run IB performance test based on VM's topology information. This way, the user can validate the IB performance across VM pairs with different distance as a quick test instead of pair-wise test. To run with topo-aware pattern, user needs to specify three required (and two optional) parameters in YAML config file: --pattern topo-aware --ibstat path to ibstat output --ibnetdiscover path to ibnetdiscover output --min_dist minimum distance of VM pairs (optional, default 2) --max_dist maximum distance of VM pairs (optional, default 6) The newly added topo_aware module then parses the topology information, builds a graph, and generates the VM pairs with the specified distance (# hops). The specified IB test will then be running across these generated VM pairs. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add description about topology aware ib traffic tests Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Add unit test to verify generated topology aware config file This commit adds unit test to verify the generated topology aware config file is correct. To do so, four new data files are added in order to invoke gen_topo_aware_config function to generate topology aware config file, then compares it with the expected config file. Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com> * Fix lint issue on Azure pipeline Signed-off-by:
Jie Zhang <jessezhang1010@gmail.com>
-
- 25 Jul, 2022 1 commit
-
-
Yang Wang authored
Fix an unexpected result value (`-0.125`) issue in ib traffic benchmark when encountering `-1` in raw output * Check if the value is valid before the base conversion * Add a test case to cover this situation
-
- 20 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix potential port conflict due to race condition between time-to-check to time-to-use, by binding the port all through. Modify the function to resolve flake8 C901 while keeping the logic same.
-
- 09 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Fix several issues in ib validation benchmark: * continue running when timeout in the middle, instead of aborting whole mpi process * make timeout parameter configurable, set default to 120 seconds * avoid mixture of stdio and iostream when print to stdout * set default message size to 8M which will saturate ib in most cases * fix hostfile path issue so that it can be auto found in different cases
-
- 08 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in both 1 node and all nodes in one config by changing `node_num`. Update docs and add test case accordingly.
-
- 05 Jul, 2022 1 commit
-
-
Yifan Xiong authored
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
-
- 29 Jun, 2022 1 commit
-
-
Yifan Xiong authored
Fix several issues in ib loopback benchmark: * use `--report_gbits` and divide by 8 to get GB/s, previous results are MiB/s / 1000 * use the ib_write_bw binary built in third_party instead of system path * update the metrics name so that different hca indices have same metric
-
- 24 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Support multiple IB/GPU devices run simultaneously in ib validation benchmark. **Major Revisions** - Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel. - Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes. - Fix env issues in Dockerfile for end-to-end test. - Update ib-traffic configuration examples in config files. - Update unit tests and docs accordingly. Closes #326.
-
- 14 Jun, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Support `sb run` on host directly without Docker **Major Revisions** - Add `--no-docker` argument for `sb run`. - Run on host directly if `--no-docker` if specified. - Update docs and tests correspondingly.
-
- 01 Jun, 2022 1 commit
-
-
user4543 authored
**Description** Fix bugs in data diagnosis. **Major Revision** - add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0' - save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True - fix bug of using wrong column index when applying format(red color and percentile) in the excel
-
- 29 Apr, 2022 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.5.0 to main. **Major Revisions** * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by:Yuting Jiang <v-yutjiang@microsoft.com>
-
- 11 Apr, 2022 2 commits
-
-
guoshzhao authored
**Description** Integrate FAMBench into superbench based on docker implementation: https://github.com/facebookresearch/FAMBench The script to run all benchmarks is: https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh
-
user4543 authored
**Description** Integrate output all nodes diagnosis results.
-
- 10 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Output results of all nodes in data diagnosis.
-
- 08 Apr, 2022 1 commit
-
-
user4543 authored
**Description** Integrage result summary and update output format of data diagnosis. **Major Revision** - integrage result summary - add md and html format for data diagnosis
-
- 01 Apr, 2022 1 commit
-
-
guoshzhao authored
**Description** Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.
-
- 24 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add result summary in excel,md,html format. **Major Revision** - Add ResultSummary class to support result summary in excel,md,html format. - Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.
-
- 16 Mar, 2022 1 commit
-
-
rafsalas19 authored
**Description** Modifications adding GPU-Burn to SuperBench. - added third party submodule - modified Makefile to make gpu-burn binary - added/modified microbenchmarks to add gpu-burn python scripts - modified default and azure_ndv4 configs to add gpu-burn
-
- 15 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Add md and html output format for DataDiagnosis. **Major Revision** - add md and html support in file_handler - add interface in DataDiagnosis for md and HTML output **Minor Revision** - move excel and json output interface into DataDiagnosis
-
- 07 Mar, 2022 1 commit
-
-
user4543 authored
**Description** Abstract RuleBase from DataDiagnosis.
-
- 20 Feb, 2022 1 commit
-
-
user4543 authored
**Description** Add multi-rules feature for data diagnosis to support multiple rules' combined check. **Major Revision** - revise rule design to support multiple rules combination check - update related codes and tests
-
- 09 Feb, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.
-
- 08 Feb, 2022 1 commit
-
-
Ziyue Yang authored
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
-
- 07 Feb, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit does the following to optimize result variance in gpu_copy benchmark: 1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead; 2) Use CUDA events for timing instead of CPU timestamps; 3) Make data checking an option that is not preferred to be enabled in performance test; 4) Enlarge message size in performance benchmark.
-
- 29 Jan, 2022 2 commits
-
-
Yifan Xiong authored
Support T4 and A10 in GEMM benchmark.
-
Yifan Xiong authored
Support customized env for all modes in configuration.
-
- 28 Jan, 2022 2 commits
-
-
guoshzhao authored
**Description** Please write a brief description and link the related issue if have. **Major Revision** - Sync (do allreduce max) the E2E training results among all workers. - Avoid using ':0' in metric name if there has only one rank having output.
-
guoshzhao authored
**Description** Add timeout feature for each benchmark. **Major Revision** - Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future. - Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254: [ansible.py:80][WARNING] Run failed, return code 254. - Using `timeout` command to terminate the client process.
-
- 25 Jan, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Update benchmark naming to support annotations. __Major Revisions__ - Update name for `create_benchmark_context` in executor. - Backward compatibility for model benchmarks using "_models" suffix. - Update documents.
-
- 21 Jan, 2022 1 commit
-
-
Ziyue Yang authored
**Description** This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.
-
- 19 Jan, 2022 1 commit
-
-
guoshzhao authored
**Description** Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
-
- 18 Jan, 2022 1 commit
-
-
Yifan Xiong authored
__Description__ Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks. <details> <summary>Examples</summary> <pre> $ sb benchmark list -n [a-z]+-bw -o table Result -------- mem-bw nccl-bw rccl-bw </pre> <pre> $ sb benchmark list-parameters -n mem-bw === mem-bw === optional arguments: --bin_dir str Specify the directory of the benchmark binary. --duration int The elapsed time of benchmark in seconds. --mem_type str [str ...] Memory types to benchmark. E.g. htod dtoh dtod. --memory str Memory argument for bandwidthtest. E.g. pinned unpinned. --run_count int The run count of benchmark. --shmoo_mode Enable shmoo mode for bandwidthtest. default values: {'bin_dir': None, 'duration': 0, 'mem_type': ['htod', 'dtoh'], 'memory': 'pinned', 'run_count': 1} </pre> </details> __Major Revisions__ * Add `sb benchmark list` to list benchmarks matching given name. * Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name. __Minor Revisions__ * Sort format help text for argparse.
-