Commits · 2101e933ccef2fd4113c8634d40129eab52a153a · tsoc / superbenchmark

28 Jul, 2024 1 commit
- CI/CD - Fix MSCCL build error in CUDA12.4 docker build pipeline (#633) · 2101e933
  Yuting Jiang authored Jul 29, 2024
```
**Description**
Fix MSCCL build error in CUDA12.4 docker build pipeline due to OOM
issue.
```
  2101e933
22 Apr, 2024 1 commit

Dockerfile - Add CUDA 12.4 dockerfile (#619) · 7435f10a

Yuting Jiang authored Apr 22, 2024

**Description**
Add CUDA 12.4 dockerfile.

**Major Revision**
- upgrade nvidia docker into 23.04


**Minor Revision**
- upgrade hpcx into 2.18

7435f10a

18 Apr, 2024 1 commit
- Dockerfile - Upgrade mlc to v3.11 (#620) · dc3846cb
  Yuting Jiang authored Apr 18, 2024
```
**Description**
Upgrade mlc to v3.11.
```
  dc3846cb
21 Mar, 2024 1 commit

Bug Fix - Bug fix for cuda 12.2 dockerfile LD_LIBRARY_PATH issue (#614) · eeaa9b1a

Yang Wang authored Mar 22, 2024

**Description**
Cuda 12.2 image will report undfined symbol error due to incomplete
LD_LIBRARY_PATH:


![image](https://github.com/microsoft/superbenchmark/assets/25875482/1a7c48c7-cb6b-4e3a-abbe-dde23007a96b)

### How to reproduce:
1. Deploy sb with cuda12.2 image
```
sb deploy -f local.ini -i superbench/superbench:v0.10.0-cuda12.2
```
2. Enter to the container
```
sudo docker exec -it sb-workspace bash
```
3. Execute `mpirun`:
```
root@sb-container:~# mpirun
mpirun: symbol lookup error: mpirun: undefined symbol: opal_libevent2022_event_base_loop
```
### Fix to fix
* Append hpcx_load into /etc/bash.bashrc for updaing env LD_LIBRARY_PATH in each time

---------

eeaa9b1a

08 Jan, 2024 1 commit

Release - SuperBench v0.10.0 (#607) · 2c88db90

Yifan Xiong authored Jan 07, 2024



**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>
Co-authored-by: guoshzhao <guzhao@microsoft.com>

2c88db90

09 Dec, 2023 1 commit

Dockerfile - Upgrade to rocm5.7 dockerfile (#587) · 1f5031bd

Yuting Jiang authored Dec 10, 2023



**Description**
upgrade to rocm5.7 dockerfile.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>

1f5031bd

07 Dec, 2023 2 commits
- Benchmarks: Add MSCCL Support for Nvidia GPU (#584) · 6ef3a011
  Ziyue Yang authored Dec 07, 2023
```
**Description**
Add MSCCL support for Nvidia GPU
```
  6ef3a011
- Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582) · dd5a6329
  Yuting Jiang authored Dec 07, 2023
```
**Description**
Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark
```
  dd5a6329
22 Nov, 2023 3 commits

Dockerfile - Upgrade Docker image to CUDA 12.2 (#577) · 1ad1c21c

Yifan Xiong authored Nov 22, 2023

Upgrade Docker image to CUDA 12.2 for H100:
* upgrade base image to 23.10
* fix onnxruntime version in python3.10
* fix compilation errors

1ad1c21c

Benchmarks: Micro benchmark - Add hipBLASLt function benchmark (#576) · 79089b65
Yuting Jiang authored Nov 22, 2023
```
**Description**
hipblaslt function benchmark and rebase cublaslt function benchmark.
```
79089b65

Analyzer - Generate baseline given results from multiple nodes. (#575) · 9f4880cb

guoshzhao authored Nov 22, 2023



**Description**
Generate baseline given results from multiple nodes. 

**Major Revision**
- Add sub command `sb result generate-baseline`
- Add UT and docs

---------
Co-authored-by: 454314380 <454314380@qq.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

9f4880cb

23 Oct, 2023 1 commit

Dockerfile - update mlc version into 3.10 for cuda and rocm dockerfiles (#562) · d246bab4

Yuting Jiang authored Oct 23, 2023



**Description**
Update mlc version into 3.10 for cuda and rocm dockerfiles to be
consistent with cuda12 dockerfile
Co-authored-by: yukirora <yuting.jiang@microsoft.com>

d246bab4

22 Aug, 2023 1 commit
- Benchmarks: micro benchmark - source code for evaluating NVDEC decoding performance (#560) · 27a10811
  Yuting Jiang authored Aug 22, 2023
```
**Description**
source code for evaluating NVDEC decoding performance.

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
```
  27a10811
06 Jul, 2023 1 commit
- Benchmarks: micro benchmarks - add python code for DirectXGPUEncodingLatency (#548) · e8ac0b1e
  Yuting Jiang authored Jul 06, 2023
```
**Description**
add python code for DirectXGPUEncodingLatency.
```
  e8ac0b1e
03 Jul, 2023 1 commit
- Benchmarks: Build Pipeline - add AMF in third party and build AMF encoding latency test (#543) · 86547217
  Yuting Jiang authored Jul 03, 2023
```
**Description**
add AMF in third party and build AMF encoding latency test.
```
  86547217
28 Jun, 2023 1 commit

Dockerfile - Add SuperBench Windows Dockerfile (#534) · 44ef5314

Yuting Jiang authored Jun 28, 2023



**Description**
Add dockerfile for win10 and building script for directx_benchmarks.

**Major Revision**
- Add docker file for win10 and required scripts to install the
dependency
- Add building script to build all directx vs benchmarks
- Add call of building script in Makefile

---------
Co-authored-by: yukirora <yuting.jiang@microsoft.com>
Co-authored-by: Yifan Xiong <yifan.xiong@microsoft.com>

44ef5314

14 Apr, 2023 1 commit

Release - SuperBench v0.8.0 (#517) · 51761b3a

Yifan Xiong authored Apr 14, 2023



**Description**

Cherry-pick bug fixes from v0.8.0 to main.

**Major Revisions**

* Monitor - Fix the cgroup version checking logic (#502)
* Benchmark - Fix matrix size overflow issue in cuBLASLt GEMM (#503)
* Fix wrong torch usage in communication wrapper for Distributed
Inference Benchmark (#505)
* Analyzer: Fix bug in python3.8 due to pandas api change (#504)
* Bug - Fix bug to get metric from cmd when error happens (#506)
* Monitor - Collect realtime GPU power when benchmarking (#507)
* Add num_workers argument in model benchmark (#511)
* Remove unreachable condition when write host list (#512)
* Update cuda11.8 image to cuda12.1 based on nvcr23.03 (#513)
* Doc - Fix wrong unit of cpu-memory-bw-latency in doc (#515)
* Docs - Upgrade version and release note (#508)
Co-authored-by: guoshzhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

51761b3a

21 Mar, 2023 1 commit

Adding HPL benchmark (#482) · 655bd0aa

rafsalas19 authored Mar 21, 2023



**Description**

- Adding HPL benchmark

---------
Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

655bd0aa

06 Mar, 2023 1 commit

Pin setuptools version to v65.7.0 (#483) · 35f53905

Yifan Xiong authored Mar 06, 2023

Pin setuptools version to
[v65.7.0](https://setuptools.pypa.io/en/latest/history.html#v65-7-0) to
avoid breaking changes since v66.0.0.

35f53905

13 Feb, 2023 1 commit

Adding Stream Benchmark (#473) · 32896ca4

rafsalas19 authored Feb 13, 2023



**Description**

- Added stream benchmark
- Added stream unit test
- Added stream example
- Modified docker files to build stream

---------
Co-authored-by: Ubuntu <azureuser@sbtestvm.jzlku1oskncengjiado35wf1hd.ax.internal.cloudapp.net>
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>
Co-authored-by: Yifan Xiong <xiongyf@yandex.com>

32896ca4

07 Feb, 2023 1 commit

Dockerfile: Remove fixed rccl version in rocm5.1.x docker file (#476) · f21bfef2

pnunna93 authored Feb 07, 2023

**Description**
The commit(e08b6d3a) installs a rccl
version which is causing "undefined symbol: ncclGetLastError" while
trying to import torch. Revert it to avoid the error.

f21bfef2

29 Dec, 2022 1 commit

Dockerfile - Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449) · a3c65b2a

Yifan Xiong authored Dec 29, 2022

Add Docker image for arch90 NVIDIA GPUs:

* add CUDA11.8 Dockerfile
* update archs in Makefile and benchmarks accordingly
* update image build pipeline

a3c65b2a

31 Oct, 2022 1 commit

CLI - Update version to include revision hash and date (#427) · d7bb8303

Yifan Xiong authored Oct 31, 2022

Update version to include revision hash and date in "{last tag}+g{git
hash}.d{date}" format, here're the examples:
* exact tag: 0.6.0
* commit after tag: 0.6.0+gcbb1b34
* commit after tag with local changes: 0.6.0+gcbb1b34.d20221028

d7bb8303

06 Sep, 2022 1 commit

Release - SuperBench v0.6.0 (#409) · 63e9b2d1

Yifan Xiong authored Sep 06, 2022



**Description**

Cherry-pick bug fixes from v0.6.0 to main.

**Major Revisions**

* Enable latency test in ib traffic validation distributed benchmark (#396)
* Enhance parameter parsing to allow spaces in value (#397)
* Update apt packages in dockerfile (#398)
* Upgrade colorlog for NO_COLOR support (#404)
* Analyzer - Update error handling to support exit code of sb result diagnosis (#403)
* Analyzer - Make baseline file optional in data diagnosis and fix bugs (#399)
* Enhance timeout cleanup to avoid possible hanging (#405)
* Auto generate ibstat file by pssh (#402)
* Analyzer - Format int type and unify empty value to N/A in diagnosis output file (#406)
* Docs - Upgrade version and release note (#407)
* Docs - Fix issues in document (#408)
Co-authored-by: Yang Wang <yangwang1@microsoft.com>
Co-authored-by: Yuting Jiang <yutingjiang@microsoft.com>

63e9b2d1

17 Aug, 2022 1 commit

Update Python setup for require packages (#387) · 626ac0a4

Yifan Xiong authored Aug 17, 2022

__Description__

Update Python setup for require packages.

__Major Revisions__
* downgrade requests version to be compatible with python 3.6, add corresponding pipeline for 3.6
* add extra entry in extras_require for nested packages
* update `pip install` contents accordingly

626ac0a4

13 Aug, 2022 1 commit

Auto generate ibstat file for topo aware traffic pattern (#381) · faeee0a7

Yang Wang authored Aug 13, 2022

An enhancement for topo-aware IB performance validation #373.
This PR will auto-generate a required ibstate file `ib_traffic_topo_aware_ibstat.txt` which is used as input to build a graph.

faeee0a7

13 Jul, 2022 1 commit

Add dependencies (#374) · 16b6385d

Yifan Xiong authored Jul 13, 2022

Add dependencies

* include ndv4-topo.xml in cuda docker images
* require requests version to avoid RequestsDependencyWarning

16b6385d

06 Jul, 2022 1 commit

Update dependencies and Dockerfile (#371) · 9f03d568

Yifan Xiong authored Jul 06, 2022

Update dependencies and Dockerfile:
* upgrade nccl-tests and rccl-tests to current latest version to match
  NCCL/RCCL versions
* unify image tag names on DockerHub
* remove verbose output in Dockerfile and minor fix some flags

9f03d568

24 Jun, 2022 2 commits

Fix incorrect ulimit config in Dockerfile (#364) · 325a7338

Yifan Xiong authored Jun 24, 2022

Fix incorrect ulimit nofile config in Dockerfile.

Instead of bash, sh is used by default where `echo` does not accept any parameters and `-e` is written into /etc/security/limits.conf.

325a7338

Support multiple IB/GPU in ib validation (#363) · bfaa1c83

Yifan Xiong authored Jun 24, 2022

**Description**

Support multiple IB/GPU devices run simultaneously in ib validation benchmark.

**Major Revisions**
- Revise ib_validation_performance.cc so that multiple processes per node could be used to launch multiple perftest commands simultaneously. For each node pair in the config, number of processes per node will run in parallel.
- Revise ib_validation_performance.py to correct file paths and adjust parameters to specify different NICs/GPUs/NUMA nodes.
- Fix env issues in Dockerfile for end-to-end test.
- Update ib-traffic configuration examples in config files.
- Update unit tests and docs accordingly.

Closes #326.

bfaa1c83

19 Jun, 2022 1 commit

Update ROCm Dockerfile (#361) · 483bf782

Yifan Xiong authored Jun 19, 2022

**Description**

Update ROCm Dockerfile.

**Major Revisions**
- Add dockerfile for ROCm 5.1.3
- Merge 5.1.x and 5.0.x dockerfile
- Remove 4.2 and 4.0 legacy
- Update build pipeline accordingly

483bf782

15 Jun, 2022 1 commit

Fix cmake and build issues (#360) · 60a3c743

Yifan Xiong authored Jun 15, 2022

**Description**

Fix cmake and build issues.

**Major Revision**

* Remove unnecessary boost build
* Remove user-agent for mlc
* Remove -j for third party to build each project in sequence
* Fix ansible collections installation path

60a3c743

31 May, 2022 1 commit
- Dockerfile - Add support to run sb command inside docker image (#356) · 3f135e46
  user4543 authored Jun 01, 2022
```
**Description**
Add support to run sb command inside docker image - install missing dependency.
```
  3f135e46
27 May, 2022 1 commit
- Dockerfile: Update rccl version and fix issue in rocm5.1.1 dockerfile (#354) · e08b6d3a
  user4543 authored May 27, 2022
```
**Description**
Update rccl version and fix issue in rocm5.1.1 dockerfile.
```
  e08b6d3a
25 May, 2022 1 commit
- Dockerfile - Add dockerfile for rocm5.1.1 (#353) · 81a4146b
  user4543 authored May 25, 2022
```
**Description**
Add dockerfile for rocm5.1.1.
```
  81a4146b
28 Feb, 2022 1 commit
- Dockerfile - Add dockerfile for rocm5.0.1 (#319) · 425b9ff8
  user4543 authored Feb 28, 2022
```
**Description**
Add dockerfile for rocm5.0.1.
```
  425b9ff8
25 Feb, 2022 1 commit
- Dockerfile - Add rocm5.0 dockerfile (#307) · a4950a70
  user4543 authored Feb 26, 2022
```
**Description**
Add rocm5.0 dockerfile.
```
  a4950a70
08 Feb, 2022 1 commit

Benchmarks: Add Feature - Add GDR-only nccl-tests for Nvidia machines (#299) · 433785fd

Ziyue Yang authored Feb 08, 2022

This commit adds GDR-only nccl-tests for Nvidia machines. Also bump NCCL to v2.10.3-1 to achieve peak performance in this test.

433785fd

30 Dec, 2021 1 commit

Release - SuperBench v0.4.0 (#278) · ff563b66

Yifan Xiong authored Dec 30, 2021



__Description__

Cherry-pick  bug fixes from v0.4.0 to main.

__Major Revisions__

* Bug - Fix issues for Ansible and benchmarks (#267)
* Tests - Refine test cases for microbenchmark (#268)
* Bug - Build openmpi with ucx support in rocm dockerfiles (#269)
* Benchmarks: Fix Bug - Fix fio build issue (#272)
* Docs - Unify metric and add doc for cublas and cudnn functions (#271)
* Monitor: Revision - Add 'monitor/' prefix to monitor metrics in result summary (#274)
* Bug - Fix bug of detecting if gpu_index is none (#275)
* Bug - Fix bugs in data diagnosis (#273)
* Bug - Fix issue that the root mpi rank may not be the first in the hostfile (#270)
* Benchmarks: Configuration - Update inference and network benchmarks in configs (#276)
* Docs - Upgrade version and release note (#277)
Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

ff563b66

13 Dec, 2021 1 commit

Benchmarks: Add Benchmark - Add mlc benchmark to superbench (#216) · b590409e

Hossein Pourreza authored Dec 12, 2021

**Description**
Add mlc memory bandwidth and latency micro benchmark to Superbench.

**Major Revision**
- Add mlc benchmark with test and example files

b590409e