Commits · dfbd70b129c7420deff2a19de28c12c3ce2d431f · tsoc / superbenchmark

26 Sep, 2021 1 commit

Release - SuperBench v0.3.0 (#212) · dfbd70b1

Yifan Xiong authored Sep 26, 2021



**Description**

Cherry-pick  bug fixes from v0.3.0 to main.

**Major Revisions**
* Docs - Upgrade version and release note (#209)
* Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210)
* Benchmarks: Update - Update benchmarks in configuration file (#208)
* CI/CD - Update GitHub Action VM (#211)
* Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203)
* CI/CD - Fix bug in build image for push event (#205)
* Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204)
* Tool: Fix bug - Fix function naming issue in system info  (#200)
* CI/CD - Push images in GitHub Action (#202)
* Bug - Fix torch.distributed command for single node (#201)
* CLI - Integrate system info for node (#199)
* Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196)
* CI/CD - Add ROCm image build in GitHub Actions (#194)
* Bug: Fix bug - fix bug of hipBusBandwidth build (#193)
* Benchmarks: Build Pipeline - Restore rocblas build logic (#197)
* Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198)
* Bug - Revise 'docker run' in sb deploy (#195)
* Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190)
Co-authored-by: Yuting Jiang <v-yujiang@microsoft.com>
Co-authored-by: Guoshuai Zhao <guzhao@microsoft.com>
Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>

dfbd70b1

06 Sep, 2021 1 commit

Tools: Add Feature - Add script to generate system config info. (#160) · 37b15db9

Yuting Jiang authored Sep 06, 2021

**Description**
Add script to generate system config info.

**Major Revision**
- Add script to generate system config info into the dict in superbench/tools.

37b15db9

03 Sep, 2021 1 commit

Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode... · 60762518

Yuting Jiang authored Sep 03, 2021

Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode and rename metric (#189)

**Description**
Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric .

**Major Revision**
- revise argument operators to be a single one

**Minor Revision**
- rename metric to remove benchmark name info
- change argument ngpus default value to be 1

60762518

02 Sep, 2021 6 commits

Dockerfile - Fix ulimit nofile in Docker images (#183) · 4e431f11

Yifan Xiong authored Sep 02, 2021

__Description__

Resolve "too many open files" issue when runnning NCCL/RCCL on multiple nodes using Docker images, increase nofile number in limits.conf.

4e431f11

Benchmarks: Fix bug - Fix missing key error in disk performance benchmark (#188) · b79e2845
Ziyue Yang authored Sep 02, 2021
```
**Description**
This commit fixes error of missing key 'percentile' in parsing FIO result.
```
b79e2845

Benchmarks: Add Configuration - Add microbenchmark in the validation config... · 47daedbe

Yuting Jiang authored Sep 02, 2021

Benchmarks: Add Configuration - Add microbenchmark in the validation config file for HPE (AMD MI00) (#176)

**Description**
Add microbenchmark in the validation config file for AMD MI00.

**Major Revision**
- add rccl-bw, mem-bw,ib-loopback,gemm-flops,kernel-launch config for mi100

47daedbe

Docs - Support docsearch in website (#184) · 2ebb44cc
Yifan Xiong authored Sep 02, 2021
```
Support docsearch in website, powered by [Algolia](https://docsearch.algolia.com).
```
2ebb44cc

Runner - Fix inventory issue in ansible_runner (#185) · e2453e1c

Yifan Xiong authored Sep 02, 2021

__Description__

Fix inventory bug in ansible_runner when host list is provided with multiple hosts.

It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.

e2453e1c

Docs: Add system config info for result collection (#168) · ab71bbb4
TobeyQin authored Sep 02, 2021
```
**Description**
Add system config info for result collection
```
ab71bbb4

01 Sep, 2021 3 commits

Benchmarks: Code Revision - revise the DockerBenchmark base class (#179) · 37d5dfd5

guoshzhao authored Sep 01, 2021

**Description**
Revise the DockerBenchmark base to support image pull, image rm etc.

**Major Revision**
- image pull in _preprocess()
- image clean in _postprocess()
- execute customized commands in _benchmark()
- add unit tests

37d5dfd5

Dockerfile: Add Package - Install openmpi for ROCm images (#181) · 115cd2e6
guoshzhao authored Sep 01, 2021
```
**Description**
Install openmpi-4.0.0 for ROCm images.
```
115cd2e6

Benchmarks: Docker Benchmarks - Setup Docker-in-Docker environment (#180) · 7d947757

guoshzhao authored Sep 01, 2021

**Description**
Setup docker environment in docker container.

**Major Revision**
- Install docker client for cuda and rocm images.
- Mount /var/run/docker.sock from host

7d947757

31 Aug, 2021 5 commits

Benchmarks: Build Pipeline - Support rocblas building in... · b90b47f3

Yuting Jiang authored Sep 01, 2021

Benchmarks: Build Pipeline - Support rocblas building in rocm4.0_ubuntu18.04_py3.6_pytorch_1.7.0 docker (#172)

**Description**
Revise rocblas building logic in third_party/makefile to support rocblas building in rocm4.0_ubuntu18.04_py3.6_pytorch_1.7.0 docker.

**Major Revision**
- add extra building logic including env about pthread limit and build command restrict to reduce amount of resource used

**Minor Revision**
- make rocm_version to be able to modify

b90b47f3

Benchmarks: Code Revision - Revise metric name generation and default config... · 024a870b

Ziyue Yang authored Aug 31, 2021

Benchmarks: Code Revision - Revise metric name generation and default config for disk performance benchmark (#175)

**Description**
This commit revises disk performance benchmark, including:
1) Add missing benchmark name in default config;
2) Avoid using reserved character ':' in metric name.

024a870b

Dockerfile: Add dockerfile - Add rocm 4.0 and 4.2 dockerfile with pytorch1.7.0 (#164) · a7f508e4
guoshzhao authored Aug 31, 2021
```
**Description**
Add dockerfile `rocm4.0-pytorch1.7.0.dockerfile` and `rocm4.2-pytorch1.7.0.dockerfile` for `rocm` platform.
```
a7f508e4

Setup: Revision - Revise torch extra_require (#177) · c8357f4e

guoshzhao authored Aug 31, 2021

**Description**
change the minimal version requirement for superbench:
```
'torch>=1.7.0a0',
'torchvision>=0.8.0a0',
```

c8357f4e

Benchmarks: Code Revision - Revise subprocess invoke (#178) · 8cd264fd
guoshzhao authored Aug 31, 2021
```
**Description**
Package frequently-used subprocess invoke into function.
```
8cd264fd

30 Aug, 2021 6 commits

Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169) · b97197f0
Ziyue Yang authored Aug 30, 2021
```
**Description**
This commit adds gpu_sm_copy benchmark and related tests.
```
b97197f0

Docs: Revision - Revise results contributing rule (#174) · de481cb0

TobeyQin authored Aug 30, 2021

**Description**
Revise results contributing rule.

- Change the results uploading path to [superbench-results](https://github.com/microsoft/superbench-results

) repo.
- Add description of how to get system info by command.
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

de481cb0

Docs: Add document for SuperBench YAML config (#158) · 0b74b2aa
Yifan Xiong authored Aug 30, 2021
```
**Description**
Add document for SuperBench YAML config file.
```
0b74b2aa

Benchmarks: Fix Bug - Remove ib device port info in command to fix bug of ib loopback (#173) · 95c9fc95

Yuting Jiang authored Aug 30, 2021

**Description**
Remove IB device port info in command to fix bug of IB loopback.

**Major Revision**
- Remove IB device port info in command to fix bug of IB loopback

95c9fc95

Benchmarks: Add Benchmark - Add gemm flops microbenchmark for amd (#152) · f3d53c3d

Yuting Jiang authored Aug 30, 2021

**Description**
Add gemm flops microbenchmark for amd.

**Major Revision**
- Add gemm flops microbenchmark for amd.
- Add related example and test file.

f3d53c3d

Benchmarks: Code Revision - Extract base class for gemm flops microbenchmark (#165) · b0df66f7

Yuting Jiang authored Aug 30, 2021

**Description**
Extract base class for gemm flops microbenchmark.

**Major Revision**
- extract base class for gemm flops microbenchmark and add related test.
- revise gemm_flops_performance for cuda.

b0df66f7

27 Aug, 2021 4 commits

Benchmarks: Code Revision - Rename kernel_launch_overhead metrics (#171) · 35114bae

guoshzhao authored Aug 28, 2021

**Description**
Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.

35114bae

Benchmarks: Add Benchmark - Add memory bus bandwidth performance microbenchmark for amd (#153) · 666e3a94

Yuting Jiang authored Aug 27, 2021

**Description**
Add memory bus bandwidth performance microbenchmark for amd.

**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.

666e3a94

Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#162) · 2880f71e
Ziyue Yang authored Aug 27, 2021
```
**Description**
This commit adds the benchmark program for GPU-initiated data transfer benchmark.
```
2880f71e

Benchmarks: Fix Bug - fix bug of microbenmark building cublas and cudnn for... · 958ebc0e

Yuting Jiang authored Aug 27, 2021

Benchmarks: Fix Bug - fix bug of microbenmark building cublas and cudnn for amd in build pipeline (#166)

**Description**
Fix bug of microbenmark building cublas and cudnn for amd

**Major Revision**
- remove cuda LANGUAGES in project()
- check CUDAToolkit quiet and then build if found

958ebc0e

26 Aug, 2021 1 commit

Benchmarks: Code Revision - Rename computation_communication_overlap microbenchmark metric (#167) · 34cd2e8c

Yuting Jiang authored Aug 26, 2021

**Description**
Rename computation_communication_overlap microbenchmark metric .

**Major Revision**
- remove rank info in metric.
- simplify and rename metric.

34cd2e8c

25 Aug, 2021 1 commit

Benchmarks: Code Revision - Extract base class for memory bandwidth microbenchmark (#159) · e5e84a2e

Yuting Jiang authored Aug 26, 2021

**Description**
extract base class for memory bandwidth microbenchmark.

**Major Revision**
- revise and optimize cuda_memory_bandwidth_performance
- extract base class for memory bandwidth microbenchmark
- add test for base class

e5e84a2e

23 Aug, 2021 1 commit
- Benchmarks: Code Revision - fix typo in test of nccl microbenchmark. (#163) · 0583862d
  Yuting Jiang authored Aug 23, 2021
```
**Description**
 fix typo in test_nccl_bw_performance.py.

**Major Revision**
-  fix typo in test_nccl_bw_performance.py.
```
  0583862d
22 Aug, 2021 1 commit
- Benchmarks: Revise Benchmark - Add readwrite I/O pattern (#161) · 6774d7b7
  Ziyue Yang authored Aug 22, 2021
```
**Description**
This commit adds readwrite I/O pattern for FIO benchmark. Read/write ratio is fixed at 4:1.
```
  6774d7b7
20 Aug, 2021 2 commits

Runner: Add Feature - Generate summarized output files. (#157) · 7595d794

guoshzhao authored Aug 20, 2021

**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.

7595d794

Benchmarks: Build Pipeline - Add build logic of hipBusBandwidth in third_party (#151) · a1e5c90d

Yuting Jiang authored Aug 20, 2021

**Description**
Add build logic of hipBusBandwidth in third_party.

**Major Revision**
- Add build logic of hipBusBandwidth in third_party

a1e5c90d

19 Aug, 2021 1 commit

Runner - Support mpi mode (#146) · 98b6c0e3

Yifan Xiong authored Aug 19, 2021



Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

98b6c0e3

16 Aug, 2021 2 commits

Docs - Add config and docs for development experience (#155) · 96fc4d09

Yifan Xiong authored Aug 16, 2021

 Add config and docs for development experience.

__Major Revision__
- Add settings and extensions config for VSCode.
- Add devcontainer config for Codespaces.
- Update document accordingly.

96fc4d09

Benchmarks: Code Revision - change 'reduce' to 'reduce_op' (#156) · 7293e783
guoshzhao authored Aug 16, 2021
```
**Description**
Change the field name `reduce` to `reduce_op`.
```
7293e783

12 Aug, 2021 1 commit
- Docs - Add docs for Docker container and image (#154) · 783c9125
  Yifan Xiong authored Aug 12, 2021
```
Add docs on:
* Docker image tag list
* Build image and run container instructions
```
  783c9125
09 Aug, 2021 1 commit
- Benchmarks: Doc Revision - Add ReduceType into benchmarks doc. (#150) · d23ad898
  guoshzhao authored Aug 09, 2021
```
Add ReduceType description into benchmarks doc.
```
  d23ad898
06 Aug, 2021 2 commits
- Benchmarks: Add Feature - Set reduce type for current benchmarks' metrics. (#149) · acf365a8
  guoshzhao authored Aug 06, 2021
```
**Description**
Set reduce type for current benchmarks' metrics, including model benchmarks and ShardingMatmul.
```
  acf365a8
- Benchmarks: Code Revision - Calculate average value by using statistics module. (#148) · bc1a61b9
  guoshzhao authored Aug 06, 2021
```
**Description**
Replace `sum(results) / len(results)` with `statistics.mean(results)`
```
  bc1a61b9