- 10 Dec, 2021 1 commit
-
-
guoshzhao authored
**Description** Integrate monitor into Superbench. **Major Revision** - Initialize, start and stop monitor in SB executor. - Parse the monitor data in SB runner and merge into benchmark results. - Specify ReduceType for monitor metrics, such as MAX, MIN and LAST. - Add monitor configs into config file.
-
- 02 Dec, 2021 1 commit
-
-
Yifan Xiong authored
**Description** Replace `-c` argument with `-N` for `numactl` since the old `-c`/`--cpubind` argument is deprecated.
-
- 26 Sep, 2021 1 commit
-
-
Yifan Xiong authored
**Description** Cherry-pick bug fixes from v0.3.0 to main. **Major Revisions** * Docs - Upgrade version and release note (#209) * Benchmarks: Build Pipeline - Update rccl-test git submodule to dc1ad48 (#210) * Benchmarks: Update - Update benchmarks in configuration file (#208) * CI/CD - Update GitHub Action VM (#211) * Benchmarks: Fix Bug - Fix wrong parameters for gpu-sm-copy-bw in configuration examples (#203) * CI/CD - Fix bug in build image for push event (#205) * Benchmark: Fix Bug - fix error message of communication-computation-overlap (#204) * Tool: Fix bug - Fix function naming issue in system info (#200) * CI/CD - Push images in GitHub Action (#202) * Bug - Fix torch.distributed command for single node (#201) * CLI - Integrate system info for node (#199) * Benchmarks: Code Revision - Revise CMake files for microbenchmarks. (#196) * CI/CD - Add ROCm image build in GitHub Actions (#194) * Bug: Fix bug - fix bug of hipBusBandwidth build (#193) * Benchmarks: Build Pipeline - Restore rocblas build logic (#197) * Bug: Fix Bug - Add barrier before 'destroy_process_group' in model benchmarks (#198) * Bug - Revise 'docker run' in sb deploy (#195) * Bug - Fix Bug : fix bug of error param operations to operation in rccl-bw of hpe config (#190) Co-authored-by:
Yuting Jiang <v-yujiang@microsoft.com> Co-authored-by:
Guoshuai Zhao <guzhao@microsoft.com> Co-authored-by:
Ziyue Yang <ziyyang@microsoft.com>
-
- 20 Aug, 2021 1 commit
-
-
guoshzhao authored
**Description** Generate the summarized output files from all nodes. For each metric, do the reduce operation according to the `reduce_op` **Major Revision** - Generate the summarized json file per node: For microbenchmark, the format is `{benchmark_name}/[{run_count}/]{metric_name}[:rank]` For modelbenchmark, the format is `{benchmark_name}/{sub_benchmark_name}/[{run_count}/]{metric_name}` `[]` means optional. ``` { "kernel-launch/overhead_event:0": 0.00583, "kernel-launch/overhead_event:1": 0.00545, "kernel-launch/overhead_event:2": 0.00581, "kernel-launch/overhead_event:3": 0.00572, "kernel-launch/overhead_event:4": 0.00559, "kernel-launch/overhead_event:5": 0.00591, "kernel-launch/overhead_event:6": 0.00562, "kernel-launch/overhead_event:7": 0.00586, "resnet_models/pytorch-resnet50/steptime-train-float32": 544.0827468410134, "resnet_models/pytorch-resnet50/throughput-train-float32": 353.7607016465773, "resnet_models/pytorch-resnet50/steptime-train-float16": 425.40482617914677, "resnet_models/pytorch-resnet50/throughput-train-float16": 454.0142363793973, "pytorch-sharding-matmul/0/allreduce": 10.561786651611328, "pytorch-sharding-matmul/1/allreduce": 10.561786651611328, "pytorch-sharding-matmul/0/allgather": 10.088025093078613, "pytorch-sharding-matmul/1/allgather": 10.088025093078613 } ``` - Generate the summarized jsonl file for all nodes, each line is the result from one node in json format.
-
- 19 Aug, 2021 1 commit
-
-
Yifan Xiong authored
Support mpi mode in runner: * concate mpirun command * support mca and env config * prepare hostfile and update Ansible host pattern Co-authored-by:Peng Cheng <chengpeng5555@outlook.com>
-
- 02 Jul, 2021 1 commit
-
-
Yifan Xiong authored
Fetch benchmarks results on all nodes, will rsync after each benchmark. The results directory structure on control node is as follows: ``` outputs/ └── datetime ├── nodes │ └── node-0 │ ├── benchmarks │ │ ├── benchmark-0 │ │ │ ├── rank-0 │ │ │ │ └── results.json │ └── sb-exec.log ├── sb-run.log └── sb.config.yaml ```
-
- 01 Jul, 2021 1 commit
-
-
Yifan Xiong authored
* Support custom output directory. * Update document.
-
- 23 Jun, 2021 1 commit
-
-
Yifan Xiong authored
* Add `sb deploy` command content. * Fix inline if-expression syntax in playbook. * Fix quote escape issue in bash command. * Add custom env in config. * Update default config for multi GPU benchmarks. * Update MANIFEST.in to include jinja2 template. * Require jinja2 minimum version. * Fix occasional duplicate output in Ansible runner. * Fix mixed color from Ansible and Python colorlog. * Update according to comments. * Change superbench.env from list to dict in config file.
-
- 02 Jun, 2021 1 commit
-
-
Yifan Xiong authored
* Support local mode in runner.
-
- 28 May, 2021 1 commit
-
-
Yifan Xiong authored
* Support `torch.distributed` mode in runner. * Support given `proc_num` and `node_num` in `torch.distributed` mode.
-
- 12 Apr, 2021 1 commit
-
-
Yifan Xiong authored
* init runner class with not implemented
-