Commits · 51761b3af172b4fc54ce0a3abc302e203d2bf44a · tsoc / superbenchmark

14 Apr, 2023 1 commit

Release - SuperBench v0.8.0 (#517) · 51761b3a

Yifan Xiong authored Apr 14, 2023



**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

51761b3a

22 Mar, 2023 1 commit

Monitor - Support cgroup V2 when read system metrics. (#491) · a9b45a07

guoshzhao authored Mar 22, 2023

**Description**
Since ubuntu 22.04 will use cgroup V2 and the file structure changed.
Modify the monitor to adapt to cgroup v1 and v2.

a9b45a07

13 Feb, 2023 1 commit

Executor - Support SuperBench Executor running on Windows (#475) · 62a29134

Yuting Jiang authored Feb 13, 2023

**Description**
Support SuperBench Executor running on Windows.

**Major Revision**
- Lazy import ansible related module

62a29134

04 Jan, 2023 1 commit

Runner - Generate host groups file in mpi mode (#458) · 8e748d56

Yang Wang authored Jan 04, 2023

**Major Revision**

- Add an option for pattern to generate mpi_pattern.txt file if
specified the path.
- In mpi pattern, serial_index and parallel_index will add in each
benchmark as environment variables.

**Minor Revision**
- Fix typo

8e748d56

03 Jan, 2023 1 commit
- Runner: Support `topo-aware` and `k-batch` pattern in 'mpi' mode (#437) · 65e433c0
  Yang Wang authored Jan 03, 2023
```
**Description**
Support the following patterns  in `mpi` mode:
* `k-batch`
* `topo-aware`
```
  65e433c0
30 Dec, 2022 1 commit

Executor - Add stdout logging util module and enable real-time logging flushing in executor (#445) · 9dfefce3

Yuting Jiang authored Dec 30, 2022

**Description**
Add stdout logging util module and enable real-time logging flushing in executor

**Major Revision**
- Add stdout logging util module to redirect stdout into file log
- enable stdout logging in executor to write benchmark output into both stdout and file `sb-bench.log`
- enable real-time log flushing in run_command of microbenchmarks through config `log_flushing`

**Minor Revision**
- add log_n_step args to enable regular step time log in model benchmarks 
- udpate related docs

9dfefce3

29 Dec, 2022 1 commit
- Runner - Support `pair-wise` pattern in `mpi` mode (#447) · 7838b6b1
  Yang Wang authored Dec 29, 2022
```
* Extract pair-wise pattern from ib_validation
```
  7838b6b1
29 Nov, 2022 1 commit

Runner - support 'pattern' in 'mpi' mode to run tasks in parallel (#430) · e4eeda0a

Yang Wang authored Nov 29, 2022

* add mpi-parallels mode

* update according to comments

* fix and update doc

* update

* merge into 'mpi' mode

* udpate according to comments

* fix testcases

* fix ansible

* regard pattern as field

* udpate

* fix flake8 version

* add flake8 range

* remove map-by from host config

* udpate comments

e4eeda0a

06 Sep, 2022 1 commit

Release - SuperBench v0.6.0 (#409) · 63e9b2d1

Yifan Xiong authored Sep 06, 2022



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

63e9b2d1

13 Aug, 2022 1 commit

Auto generate ibstat file for topo aware traffic pattern (#381) · faeee0a7

Yang Wang authored Aug 13, 2022

An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.

faeee0a7

26 Jul, 2022 1 commit

Support topo-aware IB performance validation (#373) · ef4d6574

Jie Zhang authored Jul 26, 2022



* Support topo-aware IB performance validation

Add a new pattern `topo-aware`, so the user can run IB performance
test based on VM's topology information. This way, the user can
validate the IB performance across VM pairs with different distance
as a quick test instead of pair-wise test.

To run with topo-aware pattern, user needs to specify three required
(and two optional) parameters in YAML config file:
--pattern	topo-aware
--ibstat	path to ibstat output
--ibnetdiscover	path to ibnetdiscover output
--min_dist	minimum distance of VM pairs (optional, default 2)
--max_dist	maximum distance of VM pairs (optional, default 6)

The newly added topo_aware module then parses the topology
information, builds a graph, and generates the VM pairs with
the specified distance (# hops).

The specified IB test will then be running across these
generated VM pairs.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add description about topology aware ib traffic tests
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Add unit test to verify generated topology aware config file

This commit adds unit test to verify the generated topology aware
config file is correct. To do so, four new data files are added in
order to invoke gen_topo_aware_config function to generate topology
aware config file, then compares it with the expected config file.
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

* Fix lint issue on Azure pipeline
Signed-off-by: Jie Zhang <jessezhang1010@gmail.com>

ef4d6574

05 Jul, 2022 1 commit
- CLI - Support SKU auto detect if running on Azure VM (#365) · a94ead34
  Yifan Xiong authored Jul 05, 2022
```
Support SKU auto detect and using corresponding benchmark config if running on Azure VM.
```
  a94ead34
24 Jan, 2022 1 commit
- Bug: Fix code insecure issue that binds a socket to all network interfaces (#291) · 35fc06eb
  Yuting Jiang authored Jan 24, 2022
```
**Description**
Fix code insecure issue that binds a socket to all network interfaces.
```
  35fc06eb
15 Nov, 2021 1 commit

Benchmarks: Add Feature - Extend the device manager utility to support more functions. (#239) · cc70f9c1

guoshzhao authored Nov 15, 2021

**Description**
Rename `nvidia_helper` utility as `device_manager` module and support more functions:
```
device_manager.get_device_count()
device_manager.get_device_utilization(idx)
device_manager.get_device_temperature(idx)
device_manager.get_device_power_limit(idx)
device_manager.get_device_memory(idx)
device_manager.get_device_row_remapped_info(idx)
device_manager.get_device_ecc_error(idx)
```

cc70f9c1

31 Aug, 2021 1 commit
- Benchmarks: Code Revision - Revise subprocess invoke (#178) · 8cd264fd
  guoshzhao authored Aug 31, 2021
```
**Description**
Package frequently-used subprocess invoke into function.
```
  8cd264fd
13 Jul, 2021 1 commit

Utils: Code Revision - Update network common utils (#118) · 71c1617b

Yuting Jiang authored Jul 13, 2021


Update network common utils. Add get_ib_devices in network common utils and move get_free_port from test utils to network common utils

71c1617b

09 Jul, 2021 1 commit

Bug bash - Merge fix from release/0.2 to main (#124) · 9c984c7e

guoshzhao authored Jul 09, 2021



* Bug Fix - Fix race condition issue for multi ranks (#117)

Fix race condition issue when multi ranks rotating the same directory.

* Update pipeline for release branch (#122)

* Bug Fix - Fix bug when convert bool config to store_true argument. (#120)
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

9c984c7e

02 Jul, 2021 1 commit

Runner - Fetch benchmarks results on all nodes (#116) · fb7d4a73

Yifan Xiong authored Jul 02, 2021

Fetch benchmarks results on all nodes, will rsync after each benchmark.
The results directory structure on control node is as follows:

```
outputs/
└── datetime
    ├── nodes
    │   └── node-0
    │       ├── benchmarks
    │       │   ├── benchmark-0
    │       │   │   ├── rank-0
    │       │   │   │   └── results.json
    │       └── sb-exec.log
    ├── sb-run.log
    └── sb.config.yaml
```

fb7d4a73

01 Jul, 2021 1 commit
- CLI - Support custom output directory (#110) · 7b0b0e9a
  Yifan Xiong authored Jul 01, 2021
```
* Support custom output directory.
* Update document.
```
  7b0b0e9a
23 Jun, 2021 1 commit

Bug bash - Fix bugs in multi GPU benchmarks (#98) · c0c43b8f

Yifan Xiong authored Jun 23, 2021

* Add `sb deploy` command content.
* Fix inline if-expression syntax in playbook.
* Fix quote escape issue in bash command.
* Add custom env in config.
* Update default config for multi GPU benchmarks.
* Update MANIFEST.in to include jinja2 template.
* Require jinja2 minimum version.
* Fix occasional duplicate output in Ansible runner.
* Fix mixed color from Ansible and Python colorlog.
* Update according to comments.
* Change superbench.env from list to dict in config file.

c0c43b8f

16 Jun, 2021 1 commit

Bug bash - Fix bugs and refine log in single GPU benchmarks (#97) · ddbc51a1

Yifan Xiong authored Jun 16, 2021

Fix bugs and refine log in single GPU benchmarks:

* Fix none framework issue
* Fix empty parameter bug
* Remove missed mobilenet_v3 models
* Change benchmark registration log to debug level
* Add pid in logging
* Add missing benchmarks in default config
* Fix deprecated logging warn

ddbc51a1

01 Jun, 2021 1 commit
- Benchmarks: Add Feature - Add nvml package to provide python interfaces of nvidia. (#91) · 331c740a
  guoshzhao authored Jun 01, 2021
  
  331c740a
18 May, 2021 1 commit

CLI - Refine CLI handlers (#68) · 977b1a73

Yifan Xiong authored May 18, 2021

* use absolute path of input file
* parse registry uri from image
* merge common parts for arguments processing

977b1a73

11 May, 2021 1 commit

Utils - Support lazy import (#67) · 57ce473a

Yifan Xiong authored May 11, 2021

__Major Revision__

* Support lazy import.
* Not importing benchmarks when running `help`, `version`, `deploy` commands, etc.

57ce473a

29 Mar, 2021 1 commit

Update logger (#28) · 0e2b2b08

Yifan Xiong authored Mar 29, 2021

Update logger class.
* add file handler along with stream handler
* add colored formatter

0e2b2b08

26 Mar, 2021 1 commit
- CLI: Code Revision - Use omegaconf to replace hydra for configuration (#27) · 91b44bc5
  Yifan Xiong authored Mar 26, 2021
```
Use omegaconf to replace hydra for configuration system:
* remove hydra
* use omegaconf to merge configurations
```
  91b44bc5
12 Mar, 2021 1 commit

CLI - Add command sb [version,deploy,exec,run] (#10) · 5d11579a

Yifan Xiong authored Mar 12, 2021

- Add CLI commands
  * sb version
  * sb deploy
  * sb exec
  * sb run
- Add interface with executor and runner
- Add cli test cases

5d11579a

04 Mar, 2021 1 commit
- fix typos (#14) · abc6c991
  guoshzhao authored Mar 04, 2021
```
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
```
  abc6c991
24 Feb, 2021 1 commit
- Benchmarks: Initialization - Add base class, registry, and result (#1) · 4c87a3e4
  guoshzhao authored Feb 24, 2021
```
* benchmarks init.
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
```
  4c87a3e4