Commits · a94ead34b002ceb03c393d104f0aa9e7922d4cab · tsoc / superbenchmark

05 Jul, 2022 1 commit
- CLI - Support SKU auto detect if running on Azure VM (#365) · a94ead34
  Yifan Xiong authored Jul 05, 2022
```
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
```
  a94ead34
29 Jun, 2022 2 commits

Fix issues in ib loopback benchmark (#369) · 620192a2

Yifan Xiong authored Jun 30, 2022

Fix several issues in ib loopback benchmark:
* use `--report_gbits` and divide by 8 to get GB/s, previous results are
  MiB/s / 1000
* use the ib_write_bw binary built in third_party instead of system path
* update the metrics name so that different hca indices have same metric

620192a2

Deployment - Refine error message when GPU is not detected (#368) · 8ef7163a

Yifan Xiong authored Jun 30, 2022

Refine error message when GPU is not detected.

Possible solutions if hardware exists and drivers are already installed:
* nvidia gpus:
  ```sh
  /sbin/modprobe nvidia-uvm
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
  ```

* amd gpus
  ```sh
  modprobe amdgpu
  ```

8ef7163a

24 Jun, 2022 1 commit

Support multiple IB/GPU in ib validation (#363) · bfaa1c83

Yifan Xiong authored Jun 24, 2022

**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.

bfaa1c83

19 Jun, 2022 1 commit
- Runner - Fix sudo issue when running without Docker (#362) · 0f7b057a
  Yifan Xiong authored Jun 19, 2022
```
Fix sudo issue when running without Docker, user account could be
arbitrary in such case.
```
  0f7b057a
15 Jun, 2022 1 commit

Fix cmake and build issues (#360) · 60a3c743

Yifan Xiong authored Jun 15, 2022

**Description**

Fix cmake and build issues.

**Major Revision**

* Remove unnecessary boost build
* Remove user-agent for mlc
* Remove -j for third party to build each project in sequence
* Fix ansible collections installation path

60a3c743

14 Jun, 2022 1 commit

Support `sb run` on host directly without Docker (#358) · a4937e95

Yifan Xiong authored Jun 14, 2022

**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.

a4937e95

01 Jun, 2022 1 commit

Analyzer - Fix bugs in data diagnosis (#355) · 54da021b

user4543 authored Jun 01, 2022

**Description**
Fix bugs in data diagnosis.

**Major Revision**
- add support to get baseline of the metric which uses custom benchmark naming with ':' like 'nccl-bw:default/allreduce_8_bw:0'
- save raw data of all metrics rather than metrics defined in diagnosis_rules.yaml when output_all is True
- fix bug of using wrong column index when applying format(red color and percentile) in the excel

54da021b

29 Apr, 2022 1 commit

Release - SuperBench v0.5.0 (#350) · 6681c720

Yifan Xiong authored Apr 29, 2022



**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

6681c720

11 Apr, 2022 2 commits

Benchmarks: Add Benchmark - Add FAMBench based on docker benchmark (#338) · 80dcc8aa

guoshzhao authored Apr 11, 2022

**Description**
Integrate FAMBench into superbench based on docker implementation:
https://github.com/facebookresearch/FAMBench

The script to run all benchmarks is:
https://github.com/facebookresearch/FAMBench/blob/main/benchmarks/run_all.sh

80dcc8aa

CLI - Integrate output all nodes diagnosis results (#339) · 8dc19ca4
user4543 authored Apr 11, 2022
```
**Description**
Integrate output all nodes diagnosis results.
```
8dc19ca4

10 Apr, 2022 1 commit
- Analyzer: Add Feature - Output results of all nodes in data diagnosis (#336) · 55b0f9d2
  user4543 authored Apr 10, 2022
```
**Description**
Output results of all nodes in data diagnosis.
```
  55b0f9d2
08 Apr, 2022 1 commit

CLI - Integrage result summary and update output format of data diagnosis (#335) · f15da60b

user4543 authored Apr 08, 2022

**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis

f15da60b

01 Apr, 2022 1 commit

Benchmarks: Add Feature - Provide option to save raw data into file. (#333) · 6d895da8

guoshzhao authored Apr 01, 2022

**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.

6d895da8

24 Mar, 2022 1 commit

Analyzer: Add feature - Add result summary in excel,md,html format (#320) · 84fed1ce

user4543 authored Mar 24, 2022

**Description**
Add result summary in excel,md,html format.

**Major Revision**
- Add ResultSummary class to support result summary in excel,md,html format.
- Abstract RuleBase class for common-used functions in DataDiagnosis and ResultSummary.

84fed1ce

22 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time (#332) · c5aa4f4e
  user4543 authored Mar 22, 2022
```
**Description**
Remove fp16 samples type converting time for training cnn and lstm inference.
```
  c5aa4f4e
21 Mar, 2022 1 commit

Config - Add inference config for NC A100 and NV A10 series (#329) · a9634ef5

Yifan Xiong authored Mar 21, 2022

Add inference config for preview SKUs, including:
* [NC96ads_A100_v4](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-a100-v4-series)
* [NV18ads_A10_v5](https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series)

a9634ef5

17 Mar, 2022 1 commit
- Bug: Benchmarks - remove fp16 samples type converting time for cnn and lstm models (#330) · 6e749180
  user4543 authored Mar 17, 2022
```
**Description**
Remove fp16  samples type converting time for cnn and lstm models.
```
  6e749180
16 Mar, 2022 1 commit

Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce

rafsalas19 authored Mar 16, 2022

**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn

ff51a3ce

15 Mar, 2022 2 commits

Bug: Executor - fix bug in result writing to files for mpi mode (#328) · 84359fd8
user4543 authored Mar 16, 2022
```
**Description**
fix the bug in result writing to files for mpi mode.
```
84359fd8

Analyzer - Add md and html output format for DataDiagnosis (#325) · b3c95f18

user4543 authored Mar 15, 2022

**Description**
Add md and html output format for DataDiagnosis.

**Major Revision**
- add md and html support in file_handler
- add interface in DataDiagnosis for md and HTML output

**Minor Revision**
- move excel and json output interface into DataDiagnosis

b3c95f18

09 Mar, 2022 1 commit

Bug - Fix env path to absolute path (#327) · f755c0b6

Yifan Xiong authored Mar 09, 2022

Fix env file path to absolute path in `docker exec`, in case there're mixed ssh and local connections or different users are used.

f755c0b6

07 Mar, 2022 1 commit
- Analyzer: Revise - Abstract RuleBase from DataDiagnosis (#321) · 1ec055e1
  user4543 authored Mar 07, 2022
```
**Description**
Abstract RuleBase from DataDiagnosis.
```
  1ec055e1
06 Mar, 2022 1 commit

Benchmarks - Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322) · a9ef0f99

Jeff Daily authored Mar 06, 2022

**Description**
The BatchNorm operator is not numerically stable in fp16.  PyTorch documentation recommends to keep the BN op in fp32 for fp16 AMP models.  Refer to https://pytorch.org/docs/stable/amp.html#ops-that-can-autocast-to-float32.  Preserving BN in fp32 for superbench more accurately reflects real workloads.

a9ef0f99

24 Feb, 2022 1 commit
- Bug Fix - Fix P2P detection in gpu_copy (#317) · 01304706
  Ziyue Yang authored Feb 25, 2022
```
**Description**
Fix invalid reference of P2P detection result in gpu_copy.
```
  01304706
22 Feb, 2022 1 commit

Bug - Fix empty HIP_ARCHITECTURES issue in cmake>=3.21.0 (#315) · e0c49142

user4543 authored Feb 22, 2022

**Description**
Fix HIP_ARCHITECTURES is empty issue with cmake>=3.21.0.
Refer to https://github.com/ROCm-Developer-Tools/HIP/pull/2364

e0c49142

20 Feb, 2022 2 commits

Config - Add T4 configurations for inference (#311) · ea2c10ab
Yifan Xiong authored Feb 20, 2022
```
Add T4 configurations for inference.
```
ea2c10ab

Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289) · 97ed12f9

user4543 authored Feb 20, 2022

**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests

97ed12f9

15 Feb, 2022 1 commit
- Bug - Fix env file path (#310) · 1f48268b
  Yifan Xiong authored Feb 15, 2022
```
Fix env file path for `docker run`.
```
  1f48268b
10 Feb, 2022 1 commit

Benchmarks: Revise Code - Add support for pytorch>=1.9.0 of init_process_group (#305) · e31b8c9e

user4543 authored Feb 10, 2022

**Description**
Add support for pytorch>=1.9.0 of init_process_group.

**Major Revision**
- Use PrefixStore(TCPStore) to init_process_group manully for each model run

e31b8c9e

09 Feb, 2022 1 commit

Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302) · 6cdf7595

Ziyue Yang authored Feb 09, 2022

**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.

6cdf7595

08 Feb, 2022 2 commits
- Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299) · 433785fd
  Ziyue Yang authored Feb 08, 2022
```
This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.
```
  433785fd
- Benchmarks: Revise Code - Make data checking in gpu_copy optional (#301) · 682b2c12
  Ziyue Yang authored Feb 08, 2022
```
This commit makes data checking in gpu_copy optional, because it will take too long time if message size is large.
```
  682b2c12
07 Feb, 2022 1 commit

Benchmarks: Revise Code - Reduce result variance in gpu_copy benchmark (#298) · 85389055

Ziyue Yang authored Feb 07, 2022

**Description**
This commit does the following to optimize result variance in gpu_copy benchmark:
1) Add warmup phase for gpu_copy benchmark to avoid timing instability caused by first-time CUDA kernel launch overhead;
2) Use CUDA events for timing instead of CPU timestamps;
3) Make data checking an option that is not preferred to be enabled in performance test;
4) Enlarge message size in performance benchmark.

85389055

30 Jan, 2022 1 commit
- Bug - Fix typo in document (#297) · 28195be6
  Yuting Jiang authored Jan 30, 2022
```
Fix typo in document.
```
  28195be6
29 Jan, 2022 3 commits
- Benchmarks - Support T4 and A10 in GEMM benchmark (#294) · 3419447c
  Yifan Xiong authored Jan 29, 2022
```
Support T4 and A10 in GEMM benchmark.
```
  3419447c
- Config - Support customized env for all modes (#295) · 3524975c
  Yifan Xiong authored Jan 29, 2022
```
Support customized env for all modes in configuration.
```
  3524975c
- Benchmarks: Fix Bug - Fix GPU scan logic in gpu_copy (#296) · f3d05006
  Ziyue Yang authored Jan 29, 2022
```
Fix bug of GPU scan logic in bidirectional tests.
```
  f3d05006
28 Jan, 2022 2 commits

Benchmarks: Add Feature - Sync the E2E training results among all workers for each step. (#287) · d03d110f

guoshzhao authored Jan 28, 2022

**Description**
Please write a brief description and link the related issue if have.

**Major Revision**
- Sync (do allreduce max) the E2E training results among all workers.
- Avoid using ':0' in metric name if there has only one rank having output.

d03d110f

Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288) · d877ca23

guoshzhao authored Jan 28, 2022

**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.

d877ca23