Commits · d76e4e1c41b6116f36be590ac5df1576d3231d5a · tsoc / superbenchmark

20 Jan, 2023 2 commits

Docs - Upgrade version and release note (#467) · d76e4e1c

Yifan Xiong authored Jan 20, 2023

__Description__

Upgrade version and release note.

__Major Revision__

- Upgrade package versions
- Add release note for v0.7.0

d76e4e1c

Support `sb deploy` without pulling image (#466) · 5a93ed49
Yifan Xiong authored Jan 20, 2023
```
Support `--no-image-pull` in `sb deploy` to skip image pull to use local
cached images.
```
5a93ed49

18 Jan, 2023 1 commit

Doc - Update benchmark doc (#465) · c387d9c0

Yuting Jiang authored Jan 18, 2023

**Description**
Update benchmark doc of cublaslt, cublas-function and model benchmarks

c387d9c0

04 Jan, 2023 2 commits

Benchmarks - Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark (#454) · ccccd988
Yang Wang authored Jan 04, 2023
```
Support traffic patterns under the different devices in NCCL/RCCL test
* change the metrics format if specified the pattern
```
ccccd988

Runner - Generate host groups file in mpi mode (#458) · 8e748d56

Yang Wang authored Jan 04, 2023

**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo

8e748d56

03 Jan, 2023 2 commits
- Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437) · 65e433c0
  Yang Wang authored Jan 03, 2023
```
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
```
  65e433c0
- Benchmarks - Integrate cublaslt micro-benchmark (#455) · 616e7a5a
  Yifan Xiong authored Jan 03, 2023
```
Integrate cublaslt-gemm micro-benchmark #451.
```
  616e7a5a
30 Dec, 2022 1 commit

Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445) · 9dfefce3

Yuting Jiang authored Dec 30, 2022

**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs

9dfefce3

29 Dec, 2022 1 commit
- Runner - Support `pair-wise` pattern in `mpi` mode (#447) · 7838b6b1
  Yang Wang authored Dec 29, 2022
```
* Extract pair-wise pattern from ib_validation
```
  7838b6b1
29 Nov, 2022 1 commit

Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430) · e4eeda0a

Yang Wang authored Nov 29, 2022

* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments

e4eeda0a

06 Sep, 2022 1 commit

Release - SuperBench v0.6.0 (#409) · 63e9b2d1

Yifan Xiong authored Sep 06, 2022



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

63e9b2d1

22 Aug, 2022 1 commit

Analyzer - Add support for both jsonl and json format in data diagnosis (#388) · 10a79c4e

Yuting Jiang authored Aug 22, 2022

**Description**
Add support for both jsonl and json format in data diagnosis.

**Major Revision**
- Add support for both jsonl and json format in data diagnosis


**Minor Revision**
- change related doc
- add jsonl support in cli

10a79c4e

17 Aug, 2022 1 commit

Update Python setup for require packages (#387) · 626ac0a4

Yifan Xiong authored Aug 17, 2022

__Description__

Update Python setup for require packages.

__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly

626ac0a4

26 Jul, 2022 1 commit

Support topo-aware IB performance validation (#373) · ef4d6574

Jie Zhang authored Jul 26, 2022



* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

ef4d6574

08 Jul, 2022 1 commit

Support node_num=1 in mpi mode (#372) · e00a8180

Yifan Xiong authored Jul 08, 2022

Support `node_num: 1` in mpi mode, so that we can run mpi benchmarks in
both 1 node and all nodes in one config by changing `node_num`.
Update docs and add test case accordingly.

e00a8180

29 Jun, 2022 1 commit

Fix issues in ib loopback benchmark (#369) · 620192a2

Yifan Xiong authored Jun 30, 2022

Fix several issues in ib loopback benchmark:
* use `--report_gbits` and divide by 8 to get GB/s, previous results are
  MiB/s / 1000
* use the ib_write_bw binary built in third_party instead of system path
* update the metrics name so that different hca indices have same metric

620192a2

24 Jun, 2022 1 commit

Support multiple IB/GPU in ib validation (#363) · bfaa1c83

Yifan Xiong authored Jun 24, 2022

**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.

bfaa1c83

14 Jun, 2022 1 commit

Support `sb run` on host directly without Docker (#358) · a4937e95

Yifan Xiong authored Jun 14, 2022

**Description**

Support `sb run` on host directly without Docker

**Major Revisions**
- Add `--no-docker` argument for `sb run`.
- Run on host directly if `--no-docker` if specified.
- Update docs and tests correspondingly.

a4937e95

29 Apr, 2022 1 commit

Release - SuperBench v0.5.0 (#350) · 6681c720

Yifan Xiong authored Apr 29, 2022



**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

6681c720

20 Apr, 2022 1 commit
- Docs - Update links using relative file paths with extensions (#346) · 712eafc3
  user4543 authored Apr 21, 2022
```
**Description**
Update links of referencing other docs using relative file paths with extensions.
```
  712eafc3
15 Apr, 2022 1 commit
- Docs - Update link to cli.md (#341) · cb266911
  Jared Bowden authored Apr 15, 2022
```
**Description**
Fixes relative link in documentation: point to `../cli.md`.
```
  cb266911
11 Apr, 2022 1 commit
- CLI - Integrate output all nodes diagnosis results (#339) · 8dc19ca4
  user4543 authored Apr 11, 2022
```
**Description**
Integrate output all nodes diagnosis results.
```
  8dc19ca4
08 Apr, 2022 2 commits

Docs - Add usage for result summary (#337) · 56c9a711
user4543 authored Apr 09, 2022
```
**Description**
Add usage for result summary.
```
56c9a711

CLI - Integrage result summary and update output format of data diagnosis (#335) · f15da60b

user4543 authored Apr 08, 2022

**Description**
Integrage result summary and update output format of data diagnosis.

**Major Revision**
- integrage result summary 
- add md and html format for data diagnosis

f15da60b

01 Apr, 2022 1 commit

Benchmarks: Add Feature - Provide option to save raw data into file. (#333) · 6d895da8

guoshzhao authored Apr 01, 2022

**Description**
Use config `log_raw_data` to control whether log the raw data into file or not. The default value is `no`. We can set it as `yes` for some particular benchmarks to save the raw data into file, such as NCCL/RCCL test.

6d895da8

16 Mar, 2022 1 commit

Benchmarks: Add Feature - Add GPU-Burn as microbenchmark (#324) · ff51a3ce

rafsalas19 authored Mar 16, 2022

**Description**
Modifications adding GPU-Burn to SuperBench.
- added third party submodule
- modified Makefile to make gpu-burn binary
- added/modified microbenchmarks to add gpu-burn python scripts
- modified default and azure_ndv4 configs to add gpu-burn

ff51a3ce

20 Feb, 2022 1 commit

Analyzer: Add Feature - Add multi-rules feature for data diagnosis (#289) · 97ed12f9

user4543 authored Feb 20, 2022

**Description**
Add multi-rules feature for data diagnosis to support multiple rules' combined check.

**Major Revision**
- revise rule design to support multiple rules combination check
- update related codes and tests

97ed12f9

09 Feb, 2022 1 commit

Benchmarks: Revise Code - Eliminate NUMA binding for device-to-device tests in gpu_copy (#302) · 6cdf7595

Ziyue Yang authored Feb 09, 2022

**Description**
This commit remove NUMA binding for device-to-device tests because NUMA doesn't affect performance, and revise benchmark metrics accordingly.

6cdf7595

30 Jan, 2022 1 commit
- Bug - Fix typo in document (#297) · 28195be6
  Yuting Jiang authored Jan 30, 2022
```
Fix typo in document.
```
  28195be6
29 Jan, 2022 1 commit
- Config - Support customized env for all modes (#295) · 3524975c
  Yifan Xiong authored Jan 29, 2022
```
Support customized env for all modes in configuration.
```
  3524975c
28 Jan, 2022 1 commit

Benchmarks: Add Feature - Add timeout feature for each benchmark. (#288) · d877ca23

guoshzhao authored Jan 28, 2022

**Description**
Add timeout feature for each benchmark.

**Major Revision**
- Add `timeout` config for each benchmark. In current config files, only set the timeout for kernel-launch as example. Other benchmarks can be set in the future.
- Set the timeout config for `ansible_runner.run()`. Runner will get the return code 254:
   [ansible.py:80][WARNING] Run failed, return code 254.
- Using `timeout` command to terminate the client process.

d877ca23

25 Jan, 2022 1 commit

Config - Update benchmark naming to support annotations (#284) · 7d7cd3dc

Yifan Xiong authored Jan 25, 2022

__Description__

Update benchmark naming to support annotations.

__Major Revisions__
- Update name for `create_benchmark_context` in executor.
- Backward compatibility for model benchmarks using "_models" suffix.
- Update documents.

7d7cd3dc

21 Jan, 2022 1 commit

Benchmarks: Add Feature - Add bidirectional test support in gpu_copy benchmark (#285) · 74421ffe

Ziyue Yang authored Jan 21, 2022

**Description**
This commit adds bidirectional tests in gpu_copy benchmark for both device-host transfer and device-device transfer, and revises related tests.

74421ffe

19 Jan, 2022 1 commit
- Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283) · fd2bc9e0
  guoshzhao authored Jan 19, 2022
```
**Description**
Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
```
  fd2bc9e0
18 Jan, 2022 1 commit

CLI - Add command sb benchmark [list,list-parameters] (#279) · f7ffc545

Yifan Xiong authored Jan 18, 2022

__Description__

Add command `sb benchmark list` and `sb benchmark list-parameters` to support listing all optional parameters for benchmarks.

<details>
<summary>Examples</summary>
<pre>
$ sb benchmark list -n [a-z]+-bw -o table
Result
--------
mem-bw
nccl-bw
rccl-bw
</pre>
<pre>
$ sb benchmark list-parameters -n mem-bw
=== mem-bw ===
optional arguments:
  --bin_dir str         Specify the directory of the benchmark binary.
  --duration int        The elapsed time of benchmark in seconds.
  --mem_type str [str ...]
                        Memory types to benchmark. E.g. htod dtoh dtod.
  --memory str          Memory argument for bandwidthtest. E.g. pinned unpinned.
  --run_count int       The run count of benchmark.
  --shmoo_mode          Enable shmoo mode for bandwidthtest.
default values:
{'bin_dir': None,
 'duration': 0,
 'mem_type': ['htod', 'dtoh'],
 'memory': 'pinned',
 'run_count': 1}
</pre>
</details>

__Major Revisions__
* Add `sb benchmark list` to list benchmarks matching given name.
* Add `sb benchmark list-parameters` to list parameters for benchmarks which match given name.

__Minor Revisions__
* Sort format help text for argparse.

f7ffc545

30 Dec, 2021 1 commit

Release - SuperBench v0.4.0 (#278) · ff563b66

Yifan Xiong authored Dec 30, 2021



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

ff563b66

14 Dec, 2021 1 commit
- Docs - Add usage for data diagnosis (#266) · 682ed06a
  Yuting Jiang authored Dec 14, 2021
```
**Description**
Add usage for data diagnosis.
```
  682ed06a
13 Dec, 2021 3 commits
- Docs - Update docs for monitor. (#265) · 2e10fb0d
  guoshzhao authored Dec 13, 2021
```
**Description**
Update docs for monitor.
```
  2e10fb0d
- Benchmarks - Add transformers for TensorRT inference (#254) · cb8a3cfb
  Yifan Xiong authored Dec 13, 2021
```
Add transformers for TensorRT inference.
```
  cb8a3cfb
- Docs - Add benchmark metrics for cpu-memory-bw-latency (#264) · 10012a0a
  Ziyue Yang authored Dec 13, 2021
```
**Description**
Add benchmark metrics for cpu-memory-bw-latency.
```
  10012a0a