- 10 Dec, 2021 3 commits
-
-
guoshzhao authored
**Description** Integrate monitor into Superbench. **Major Revision** - Initialize, start and stop monitor in SB executor. - Parse the monitor data in SB runner and merge into benchmark results. - Specify ReduceType for monitor metrics, such as MAX, MIN and LAST. - Add monitor configs into config file.
-
guoshzhao authored
**Description** Set the `reduce_op` type for metirc `return_code` as `None`.
-
Yuting Jiang authored
**Description** Add cli to integrate data diagnosis module.
-
- 09 Dec, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Unify metric names of benchmarks.
-
- 08 Dec, 2021 2 commits
-
-
Yuting Jiang authored
**Description** Add data diagnosis module. **Major Revision** - Add DataDiagnosis class to support rule-based data diagnosis for result summary jsonl file of multi nodes - Add RuleOp class to define rule operators
-
Yifan Xiong authored
Fix issues for distributed runs: * fix config for memory bandwidth benchmarks * add throttling for high concurrency docker pull * update rsync path and exclude directories * handle exceptions when creating summary * tune for logging
-
- 07 Dec, 2021 1 commit
-
-
guoshzhao authored
**Description** Add return_code metric into result and revise unit tests.
-
- 03 Dec, 2021 1 commit
-
-
Yifan Xiong authored
Add config file for Azure NDm A100 v4 SKU.
-
- 02 Dec, 2021 3 commits
-
-
guoshzhao authored
**Description** Add gpt-small into config files.
-
guoshzhao authored
**Description** If `ignore_invalid` is True, and 'required' arguments are not set when register the benchmark, the arguments should be provided by user in config and skip the arguments checking.
-
Yifan Xiong authored
**Description** Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.
-
- 18 Nov, 2021 1 commit
-
-
guoshzhao authored
**Description** Add the initial version of Monitor. **Major Revision** - Add `Monitor` class to launch background process for monitoring. - Add `MonitorRecord` class to save the data one time capturing.
-
- 15 Nov, 2021 1 commit
-
-
guoshzhao authored
**Description** Rename `nvidia_helper` utility as `device_manager` module and support more functions: ``` device_manager.get_device_count() device_manager.get_device_utilization(idx) device_manager.get_device_temperature(idx) device_manager.get_device_power_limit(idx) device_manager.get_device_memory(idx) device_manager.get_device_row_remapped_info(idx) device_manager.get_device_ecc_error(idx) ```
-
- 12 Nov, 2021 1 commit
-
-
Yifan Xiong authored
__Description__ Add TensorRT inference benchmark for torchvision models. __Major Revision__ - Measure TensorRT inference performance.
-
- 09 Nov, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Add ib traffic validation distributed benchmark. **Major Revision** - Add ib traffic validation distributed benchmark, example and test
-
- 30 Oct, 2021 1 commit
-
-
Ziyue Yang authored
**Description** This commit does the following: 1) Adds CPU-initiated copy benchmark; 2) Adds dtod benchmark; 3) Support scanning NUMA nodes and GPUs inside the benchmark program; 4) Change the name of gpu-sm-copy to gpu-copy.
-
- 27 Oct, 2021 1 commit
-
-
guoshzhao authored
Add RocmOnnxModelBenchmark class to run benchmarks packaged in superbench/benchmark:rocm4.3.1-onnxruntime1.9.0
-
- 22 Oct, 2021 2 commits
-
-
Yuting Jiang authored
**Description** Add gpcnet microbenchmark **Major Revision** - add 2 microbenmark for gpcnet, gpc-network-test, gpc-network-load-test - add related test and example file
-
guoshzhao authored
Description Add CudaDockerBenchmark and RocmDockerBenchmark to support amd and cuda platform for DockerBenchmark.
-
- 21 Oct, 2021 2 commits
-
-
guoshzhao authored
**Description** Revise the all the term `onnx` to `onnxruntime`.
-
Yuting Jiang authored
**Description** Add IB validation tool source code. IB validation tool is a tool to validate IB traffic of different pattern in multi nodes flexibly **Major Revision** - Add ib validation tool source code - Add cmake file to build the source code
-
- 12 Oct, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Add tcp connectivity validation microbenchmark which is to validate TCP connectivity between current node and several nodes in the hostfile. **Major Revision** - Add tcp connectivity validation microbenchmark and related test, example
-
- 28 Sep, 2021 1 commit
-
-
guoshzhao authored
**Description** Fix typo when set force_fp32 option.
-
- 27 Sep, 2021 1 commit
-
-
guoshzhao authored
**Description** Add option `force_fp32` to use fp32 instead of tf32, only takes effect on Ampere or newer GPUs.
-
- 26 Sep, 2021 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.3.0 to main. **Major Revisions** * Docs - Upgrade version and release note (#209) * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210) * Benchmarks: Update - Update benchmarks in configuration file (#208) * CI/CD - Update GitHub Action VM (#211) * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) * CI/CD - Fix bug in build image for push event (#205) * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) * Tool: Fix bug - Fix function naming issue in system info (#200) * CI/CD - Push images in GitHub Action (#202) * Bug - Fix torch.distributed command for single node (#201) * CLI - Integrate system info for node (#199) * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) * CI/CD - Add ROCm image build in GitHub Actions (#194) * Bug: Fix bug - fix bug of hipBusBandwidth build (#193) * Benchmarks: Build Pipeline - Restore rocblas build logic (#197) * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) * Bug - Revise 'docker run' in sb deploy (#195) * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) Co-authored-by:
Yuting Jiang <v-yujiang@microsoft.com> Co-authored-by:
Guoshuai Zhao <guzhao@microsoft.com> Co-authored-by:
Ziyue Yang <ziyyang@microsoft.com>
-
- 06 Sep, 2021 1 commit
-
-
Yuting Jiang authored
**Description** Add script to generate system config info. **Major Revision** - Add script to generate system config info into the dict in superbench/tools.
-
- 03 Sep, 2021 1 commit
-
-
Yuting Jiang authored
Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode and rename metric (#189) **Description** Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric . **Major Revision** - revise argument operators to be a single one **Minor Revision** - rename metric to remove benchmark name info - change argument ngpus default value to be 1
-
- 02 Sep, 2021 3 commits
-
-
Ziyue Yang authored
**Description** This commit fixes error of missing key 'percentile' in parsing FIO result.
-
Yuting Jiang authored
Benchmarks: Add Configuration - Add microbenchmark in the validation config file for HPE (AMD MI00) (#176) **Description** Add microbenchmark in the validation config file for AMD MI00. **Major Revision** - add rccl-bw, mem-bw,ib-loopback,gemm-flops,kernel-launch config for mi100
-
Yifan Xiong authored
__Description__ Fix inventory bug in ansible_runner when host list is provided with multiple hosts. It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.
-
- 01 Sep, 2021 2 commits
-
-
guoshzhao authored
**Description** Revise the DockerBenchmark base to support image pull, image rm etc. **Major Revision** - image pull in _preprocess() - image clean in _postprocess() - execute customized commands in _benchmark() - add unit tests
-
guoshzhao authored
**Description** Setup docker environment in docker container. **Major Revision** - Install docker client for cuda and rocm images. - Mount /var/run/docker.sock from host
-
- 31 Aug, 2021 2 commits
-
-
Ziyue Yang authored
Benchmarks: Code Revision - Revise metric name generation and default config for disk performance benchmark (#175) **Description** This commit revises disk performance benchmark, including: 1) Add missing benchmark name in default config; 2) Avoid using reserved character ':' in metric name.
-
guoshzhao authored
**Description** Package frequently-used subprocess invoke into function.
-
- 30 Aug, 2021 4 commits
-
-
Ziyue Yang authored
**Description** This commit adds gpu_sm_copy benchmark and related tests.
-
Yuting Jiang authored
**Description** Remove IB device port info in command to fix bug of IB loopback. **Major Revision** - Remove IB device port info in command to fix bug of IB loopback
-
Yuting Jiang authored
**Description** Add gemm flops microbenchmark for amd. **Major Revision** - Add gemm flops microbenchmark for amd. - Add related example and test file.
-
Yuting Jiang authored
**Description** Extract base class for gemm flops microbenchmark. **Major Revision** - extract base class for gemm flops microbenchmark and add related test. - revise gemm_flops_performance for cuda.
-
- 27 Aug, 2021 2 commits
-
-
guoshzhao authored
**Description** Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.
-
Yuting Jiang authored
**Description** Add memory bus bandwidth performance microbenchmark for amd. **Major Revision** - Add memory bus bandwidth performance microbenchmark for amd. - Add related example and test file.
-