Commits · 15f22e2cd255084b4cf6dcf4d791a582c2faf2fd · tsoc / superbenchmark

24 Sep, 2021 1 commit

Docs - Upgrade version and release note (#209) · 15f22e2c

Yifan Xiong authored Sep 24, 2021

__Description__

Upgrade version and release note. Closes #95 and #170.

__Major Revisions__

* Upgrade package versions
* Add release note for v0.3.0

15f22e2c

23 Sep, 2021 1 commit

Benchmarks: Update - Update benchmarks in configuration file (#208) · a58f218b

Yuting Jiang authored Sep 23, 2021

**Description**
Update benchmarks in configuration files for single node validation of superbench v0.3.

**Major Revision**
- fix bugs of parameters in nccl-bw for single node validation in configs
- update new benchmarks in amd_mi100_hpe.yaml, amd_mi100_z53.yaml, azure_ndv4.yaml
- fix bug of wrong gpu visible prefix

a58f218b

18 Sep, 2021 3 commits

Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) · 2f0f6541
Ziyue Yang authored Sep 18, 2021
```
**Description**
This commit fixes wrong parameters for gpu-sm-copy-bw call in configuration examples.
```
2f0f6541
Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) · e80447f7
Yuting Jiang authored Sep 18, 2021
```
**Description**
fix bug in error message of communication-computation-overlap.

**Major Revision**
- remove non existing variable
```
e80447f7

Tool: Fix bug - Fix function naming issue in system info (#200) · 2d85781b

Yuting Jiang authored Sep 18, 2021

**Description**
Fix function naming issue in system info.

**Major Revision**
- fix function naming issue in system info 
- save to json file
- add timeout for subprocess.run
- revise error handling to print exception message

2d85781b

17 Sep, 2021 1 commit
- Bug - Fix torch.distributed command for single node (#201) · 890ce65d
  Yifan Xiong authored Sep 17, 2021
```
Fix `torch.distributed` command for single node.
```
  890ce65d
16 Sep, 2021 1 commit
- CLI - Integrate system info for node (#199) · f91f97b6
  Yifan Xiong authored Sep 16, 2021
```
Integrate system info for node, add `sb node info` command.
```
  f91f97b6
14 Sep, 2021 1 commit

Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) · ff487387

guoshzhao authored Sep 14, 2021

**Description**
1. Do `enable_language(CUDA)` before using `CMAKE_CUDA_COMPILER_VERSION`
2. use `cmake --install` to install target which will call `cmake -P cmake_install.cmake` instead of `make Makefile` to avoid issue `make: *** No rule to make target 'install'.  Stop.`

ff487387

13 Sep, 2021 2 commits

Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) · 7a3a4502

Yuting Jiang authored Sep 13, 2021

**Description**
Add barrier before 'destroy_process_group' to resolve the bug due to when multi models in one model benchmark, some processes haven't finished the previous process group while others failed to initialize new process group for the next model on rocm4.x when running bert_models.

**Major Revision**
-  Add barrier before 'destroy_process_group'.

7a3a4502

Bug - Revise 'docker run' in sb deploy (#195) · 1f9de77f

Yuting Jiang authored Sep 13, 2021

**Description**

Revise 'docker run' in sb deploy due to base image running endpoint/cmd under /root.

**Major Revision**

- define endpoint bash when 'docker run'

1f9de77f

09 Sep, 2021 1 commit
- Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) · 14232b56
  Yuting Jiang authored Sep 09, 2021
```
**Description**
fix bug of error param opterations of rccl-bw in hpe MI100 config

**Major Revision**
- operations->operation
```
  14232b56
06 Sep, 2021 1 commit

Tools: Add Feature - Add script to generate system config info. (#160) · 37b15db9

Yuting Jiang authored Sep 06, 2021

**Description**
Add script to generate system config info.

**Major Revision**
- Add script to generate system config info into the dict in superbench/tools.

37b15db9

03 Sep, 2021 1 commit

Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode... · 60762518

Yuting Jiang authored Sep 03, 2021

Benchmarks: Code Revision - Revise arguments of nccl/rccl to support mpi mode and rename metric (#189)

**Description**
Revise arguments of nccl/rccl to support mpi mode for (mpi can not run in nccl/rccl due to multiple operators run in sequence without barrier) and rename metric .

**Major Revision**
- revise argument operators to be a single one

**Minor Revision**
- rename metric to remove benchmark name info
- change argument ngpus default value to be 1

60762518

02 Sep, 2021 3 commits

Benchmarks: Fix bug - Fix missing key error in disk performance benchmark (#188) · b79e2845
Ziyue Yang authored Sep 02, 2021
```
**Description**
This commit fixes error of missing key 'percentile' in parsing FIO result.
```
b79e2845

Benchmarks: Add Configuration - Add microbenchmark in the validation config... · 47daedbe

Yuting Jiang authored Sep 02, 2021

Benchmarks: Add Configuration - Add microbenchmark in the validation config file for HPE (AMD MI00) (#176)

**Description**
Add microbenchmark in the validation config file for AMD MI00.

**Major Revision**
- add rccl-bw, mem-bw,ib-loopback,gemm-flops,kernel-launch config for mi100

47daedbe

Runner - Fix inventory issue in ansible_runner (#185) · e2453e1c

Yifan Xiong authored Sep 02, 2021

__Description__

Fix inventory bug in ansible_runner when host list is provided with multiple hosts.

It ought to be handled by ansible_runner lib, workaround by using `--inventory` arg in cmdline.

e2453e1c

01 Sep, 2021 2 commits

Benchmarks: Code Revision - revise the DockerBenchmark base class (#179) · 37d5dfd5

guoshzhao authored Sep 01, 2021

**Description**
Revise the DockerBenchmark base to support image pull, image rm etc.

**Major Revision**
- image pull in _preprocess()
- image clean in _postprocess()
- execute customized commands in _benchmark()
- add unit tests

37d5dfd5

Benchmarks: Docker Benchmarks - Setup Docker-in-Docker environment (#180) · 7d947757

guoshzhao authored Sep 01, 2021

**Description**
Setup docker environment in docker container.

**Major Revision**
- Install docker client for cuda and rocm images.
- Mount /var/run/docker.sock from host

7d947757

31 Aug, 2021 2 commits

Benchmarks: Code Revision - Revise metric name generation and default config... · 024a870b

Ziyue Yang authored Aug 31, 2021

Benchmarks: Code Revision - Revise metric name generation and default config for disk performance benchmark (#175)

**Description**
This commit revises disk performance benchmark, including:
1) Add missing benchmark name in default config;
2) Avoid using reserved character ':' in metric name.

024a870b

Benchmarks: Code Revision - Revise subprocess invoke (#178) · 8cd264fd
guoshzhao authored Aug 31, 2021
```
**Description**
Package frequently-used subprocess invoke into function.
```
8cd264fd

30 Aug, 2021 4 commits

Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#169) · b97197f0
Ziyue Yang authored Aug 30, 2021
```
**Description**
This commit adds gpu_sm_copy benchmark and related tests.
```
b97197f0

Benchmarks: Fix Bug - Remove ib device port info in command to fix bug of ib loopback (#173) · 95c9fc95

Yuting Jiang authored Aug 30, 2021

**Description**
Remove IB device port info in command to fix bug of IB loopback.

**Major Revision**
- Remove IB device port info in command to fix bug of IB loopback

95c9fc95

Benchmarks: Add Benchmark - Add gemm flops microbenchmark for amd (#152) · f3d53c3d

Yuting Jiang authored Aug 30, 2021

**Description**
Add gemm flops microbenchmark for amd.

**Major Revision**
- Add gemm flops microbenchmark for amd.
- Add related example and test file.

f3d53c3d

Benchmarks: Code Revision - Extract base class for gemm flops microbenchmark (#165) · b0df66f7

Yuting Jiang authored Aug 30, 2021

**Description**
Extract base class for gemm flops microbenchmark.

**Major Revision**
- extract base class for gemm flops microbenchmark and add related test.
- revise gemm_flops_performance for cuda.

b0df66f7

27 Aug, 2021 4 commits

Benchmarks: Code Revision - Rename kernel_launch_overhead metrics (#171) · 35114bae

guoshzhao authored Aug 28, 2021

**Description**
Rename `kernel_launch_overhead_event` to `event_overhead`, `kernel_launch_overhead_wall` to `wall_overhead`.

35114bae

Benchmarks: Add Benchmark - Add memory bus bandwidth performance microbenchmark for amd (#153) · 666e3a94

Yuting Jiang authored Aug 27, 2021

**Description**
Add memory bus bandwidth performance microbenchmark for amd.

**Major Revision**
- Add memory bus bandwidth performance microbenchmark for amd.
- Add related example and test file.

666e3a94

Benchmarks: Add Benchmark - Add GPU SM copy benchmark (#162) · 2880f71e
Ziyue Yang authored Aug 27, 2021
```
**Description**
This commit adds the benchmark program for GPU-initiated data transfer benchmark.
```
2880f71e

Benchmarks: Fix Bug - fix bug of microbenmark building cublas and cudnn for... · 958ebc0e

Yuting Jiang authored Aug 27, 2021

Benchmarks: Fix Bug - fix bug of microbenmark building cublas and cudnn for amd in build pipeline (#166)

**Description**
Fix bug of microbenmark building cublas and cudnn for amd

**Major Revision**
- remove cuda LANGUAGES in project()
- check CUDAToolkit quiet and then build if found

958ebc0e

26 Aug, 2021 1 commit

Benchmarks: Code Revision - Rename computation_communication_overlap microbenchmark metric (#167) · 34cd2e8c

Yuting Jiang authored Aug 26, 2021

**Description**
Rename computation_communication_overlap microbenchmark metric .

**Major Revision**
- remove rank info in metric.
- simplify and rename metric.

34cd2e8c

25 Aug, 2021 1 commit

Benchmarks: Code Revision - Extract base class for memory bandwidth microbenchmark (#159) · e5e84a2e

Yuting Jiang authored Aug 26, 2021

**Description**
extract base class for memory bandwidth microbenchmark.

**Major Revision**
- revise and optimize cuda_memory_bandwidth_performance
- extract base class for memory bandwidth microbenchmark
- add test for base class

e5e84a2e

22 Aug, 2021 1 commit
- Benchmarks: Revise Benchmark - Add readwrite I/O pattern (#161) · 6774d7b7
  Ziyue Yang authored Aug 22, 2021
```
**Description**
This commit adds readwrite I/O pattern for FIO benchmark. Read/write ratio is fixed at 4:1.
```
  6774d7b7
20 Aug, 2021 1 commit

Runner: Add Feature - Generate summarized output files. (#157) · 7595d794

guoshzhao authored Aug 20, 2021

**Description**
Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op`

**Major Revision**
- Generate the summarized json file per node:
For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]`
For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}`
`[]` means optional.
```
{
  "kernel-launch/overhead_event:0": 0.00583,
  "kernel-launch/overhead_event:1": 0.00545,
  "kernel-launch/overhead_event:2": 0.00581,
  "kernel-launch/overhead_event:3": 0.00572,
  "kernel-launch/overhead_event:4": 0.00559,
  "kernel-launch/overhead_event:5": 0.00591,
  "kernel-launch/overhead_event:6": 0.00562,
  "kernel-launch/overhead_event:7": 0.00586,
  "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134,
  "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773,
  "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677,
  "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973,
  "pytorch-sharding-matmul/0/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/1/allreduce": 10.561786651611328,
  "pytorch-sharding-matmul/0/allgather": 10.088025093078613,
  "pytorch-sharding-matmul/1/allgather": 10.088025093078613
}
```
- Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.

7595d794

19 Aug, 2021 1 commit

Runner - Support mpi mode (#146) · 98b6c0e3

Yifan Xiong authored Aug 19, 2021



Support mpi mode in runner:
* concate mpirun command
* support mca and env config
* prepare hostfile and update Ansible host pattern
Co-authored-by: Peng Cheng <chengpeng5555@outlook.com>

98b6c0e3

16 Aug, 2021 1 commit
- Benchmarks: Code Revision - change 'reduce' to 'reduce_op' (#156) · 7293e783
  guoshzhao authored Aug 16, 2021
```
**Description**
Change the field name `reduce` to `reduce_op`.
```
  7293e783
06 Aug, 2021 2 commits
- Benchmarks: Add Feature - Set reduce type for current benchmarks' metrics. (#149) · acf365a8
  guoshzhao authored Aug 06, 2021
```
**Description**
Set reduce type for current benchmarks' metrics, including model benchmarks and ShardingMatmul.
```
  acf365a8
- Benchmarks: Code Revision - Calculate average value by using statistics module. (#148) · bc1a61b9
  guoshzhao authored Aug 06, 2021
```
**Description**
Replace `sum(results) / len(results)` with `statistics.mean(results)`
```
  bc1a61b9
05 Aug, 2021 1 commit

Benchmarks: Add Feature - Add reduce function support for output summary. (#147) · e41b1f62

guoshzhao authored Aug 05, 2021

**Description**
Add reduce function support for output summary.

**Major Revision**
- Add reducer class to maintain all reduce functions.
- Save reduce type of each metric into `BenchmarkResult`
- Fix UT.

e41b1f62

30 Jul, 2021 1 commit
- Benchmarks: Add Benchmark - Revise and add rccl microbenchmark for rocm (#143) · 157b4e2d
  Yuting Jiang authored Jul 30, 2021
```
**Description**
Add rccl bandwidth microbenchmark for rocm.

**Major Revision**
- Register rccl-bw benchmark.
```
  157b4e2d
29 Jul, 2021 1 commit

Release - SuperBench v0.2.1 (#142) · 69b2c631

Yifan Xiong authored Jul 29, 2021

__Description__
Cherry-pick bug fixes from v0.2.1 to main.

__Major Revisions__
* Fix bug of VGG models failed on A100 GPU with batch_size=128.
* Fix Ansible connection issue when running in localhost.
* Update version in packages and docs.

69b2c631

27 Jul, 2021 1 commit

Benchmarks: Add Benchmark - Add the source code of rocm kernel launch overhead benchmark. (#136) · 1ee8f7dc

Yuting Jiang authored Jul 27, 2021

**Description**
Add the source code of rocm kernel launch overhead benchmark. 

**Major Revision**
- Revise cmake build logic to support both cuda and rocm

1ee8f7dc