Unverified Commit b07fda15 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.7.0 (#468)



**Description**

Cherry-pick bug fixes from v0.7.0 to main.

**Major Revisions**

* Benchmarks - Fix missing include in FP8 benchmark (#460)
* Fix bug in TE BERT model (#461)
* Doc - Update benchmark doc (#465)
* Bug: Fix bug for incorrect datatype judgement in cublas-function
source code (#464)
* Support `sb deploy` without pulling image (#466)
* Docs - Upgrade version and release note (#467)
Co-authored-by: default avatarRussell J. Hewett <russell.j.hewett@gmail.com>
Co-authored-by: default avatarYuting Jiang <yutingjiang@microsoft.com>
parent f380bc5e
...@@ -15,7 +15,7 @@ ...@@ -15,7 +15,7 @@
__SuperBench__ is a validation and profiling tool for AI infrastructure. __SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.6.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.6.0) has been released! 📢 [v0.7.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.7.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._ ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
...@@ -97,6 +97,7 @@ sb deploy [--docker-image] ...@@ -97,6 +97,7 @@ sb deploy [--docker-image]
[--host-list] [--host-list]
[--host-password] [--host-password]
[--host-username] [--host-username]
[--no-image-pull]
[--output-dir] [--output-dir]
[--private-key] [--private-key]
``` ```
...@@ -112,6 +113,7 @@ sb deploy [--docker-image] ...@@ -112,6 +113,7 @@ sb deploy [--docker-image]
| `--host-list` `-l` | `None` | Comma separated host list. | | `--host-list` `-l` | `None` | Comma separated host list. |
| `--host-password` | `None` | Host password or key passphase if needed. | | `--host-password` | `None` | Host password or key passphase if needed. |
| `--host-username` | `None` | Host username if needed. | | `--host-username` | `None` | Host username if needed. |
| `--no-image-pull` | `False` | Skip pull and use local Docker image. |
| `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. | | `--output-dir` | `None` | Path to output directory, outputs/{datetime} will be used if not specified. |
| `--private-key` | `None` | Path to private key if needed. | | `--private-key` | `None` | Path to private key if needed. |
......
...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it. ...@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note :::note Note
You should checkout corresponding tag to use release version, for example, You should checkout corresponding tag to use release version, for example,
`git clone -b v0.6.0 https://github.com/microsoft/superbenchmark` `git clone -b v0.7.0 https://github.com/microsoft/superbenchmark`
::: :::
```bash ```bash
......
...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password] ...@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note :::note Note
You should deploy corresponding Docker image to use release version, for example, You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1` `sb deploy -f local.ini -i superbench/superbench:v0.7.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone. You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
...@@ -70,7 +70,7 @@ superbench: ...@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'> <TabItem value='example'>
```yaml ```yaml
version: v0.6 version: v0.7
superbench: superbench:
enable: benchmark_1 enable: benchmark_1
monitor: monitor:
...@@ -471,4 +471,3 @@ Available variables in formatted string includes: ...@@ -471,4 +471,3 @@ Available variables in formatted string includes:
+ `ibnetdiscover(str)`: the path of ibnetdiscover output `ibnetdiscover_file.txt`, required in `topo-aware` pattern. + `ibnetdiscover(str)`: the path of ibnetdiscover output `ibnetdiscover_file.txt`, required in `topo-aware` pattern.
+ `min_dist(int)`: minimum distance of VM pair, required in `topo-aware` pattern. + `min_dist(int)`: minimum distance of VM pair, required in `topo-aware` pattern.
+ `max_dist(int)`: maximum distance of VM pair, required in `topo-aware` pattern. + `max_dist(int)`: maximum distance of VM pair, required in `topo-aware` pattern.
...@@ -66,9 +66,9 @@ Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/ ...@@ -66,9 +66,9 @@ Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|---------------------------------|----------------|---------------------------------| |------------------------------------------------|----------------|---------------------------------|
| cublaslt-gemm/dtype_m_n_k_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. | | cublaslt-gemm/${dtype}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
### `cublas-function` ### `cublas-function`
...@@ -86,9 +86,11 @@ The supported functions for cuBLAS are as follows: ...@@ -86,9 +86,11 @@ The supported functions for cuBLAS are as follows:
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|----------------------------------------------------------|-----------|-------------------------------------------------------------------| |-------------------------------------------------------------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------|
| cublas-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. | | cublas-function/name\_${function_name}\_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. |
| cublas-function/name\_${function_name}\_${parameters}_correctness | | Whether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check. |
| cublas-function/name\_${function_name}\_${parameters}_error | | The error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check. |
### `cudnn-function` ### `cudnn-function`
...@@ -103,9 +105,9 @@ The supported functions for cuDNN are as follows: ...@@ -103,9 +105,9 @@ The supported functions for cuDNN are as follows:
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|---------------------------------------------------------|-----------|------------------------------------------------------------------| |-----------------------------------------------------------|-----------|------------------------------------------------------------------|
| cudnn-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. | | cudnn-function/name\_${function_name}\_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
### `tensorrt-inference` ### `tensorrt-inference`
...@@ -264,9 +266,10 @@ Support the following traffic patterns: ...@@ -264,9 +266,10 @@ Support the following traffic patterns:
| rccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | RCCL operation algorithm bandwidth with given message size. | | rccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | RCCL operation algorithm bandwidth with given message size. |
| rccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | RCCL operation bus bandwidth with given message size. | | rccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | RCCL operation bus bandwidth with given message size. |
If traffic pattern is specified, the metrics pattern will change to `nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time` If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to `nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time`
- `serial_index` represents the serial index of the host group in serial. - `serial_index` represents the serial index of the host group in serial.
- `parallel_index` represents the parallel index of the host list in parallel. - `parallel_index` represents the parallel index of the host list in parallel.
### `tcp-connectivity` ### `tcp-connectivity`
#### Introduction #### Introduction
......
...@@ -30,19 +30,15 @@ including the following categories: ...@@ -30,19 +30,15 @@ including the following categories:
For inference, supported percentiles include For inference, supported percentiles include
50<sup>th</sup>, 90<sup>th</sup>, 95<sup>th</sup>, 99<sup>th</sup>, and 99.9<sup>th</sup>. 50<sup>th</sup>, 90<sup>th</sup>, 95<sup>th</sup>, 99<sup>th</sup>, and 99.9<sup>th</sup>.
**New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.**
#### Metrics #### Metrics
| Name | Unit | Description | | Name | Unit | Description |
|---------------------------------------------------------------------------------|------------------------|---------------------------------------------------------------------------| |-----------------------------------------------------------------------------------------|------------------------|------------------------------------------------------------------------------|
| model-benchmarks/pytorch-${model_name}/fp32_train_step_time | time (ms) | The average training step time with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_train_step_time | time (ms) | The average training step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp32_train_throughput | throughput (samples/s) | The average training throughput with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_train_throughput | throughput (samples/s) | The average training throughput with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp32_inference_step_time | time (ms) | The average inference step time with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time | time (ms) | The average inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp32_inference_throughput | throughput (samples/s) | The average inference throughput with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput | throughput (samples/s) | The average inference throughput with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp32_inference_step_time\_${percentile} | time (ms) | The n<sup>th</sup> percentile inference step time with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time\_${percentile} | time (ms) | The n<sup>th</sup> percentile inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp32_inference_throughput\_${percentile} | throughput (samples/s) | The n<sup>th</sup> percentile inference throughput with single precision. | | model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput\_${percentile} | throughput (samples/s) | The n<sup>th</sup> percentile inference throughput with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/fp16_train_step_time | time (ms) | The average training step time with half precision. |
| model-benchmarks/pytorch-${model_name}/fp16_train_throughput | throughput (samples/s) | The average training throughput with half precision. |
| model-benchmarks/pytorch-${model_name}/fp16_inference_step_time | time (ms) | The average inference step time with half precision. |
| model-benchmarks/pytorch-${model_name}/fp16_inference_throughput | throughput (samples/s) | The average inference throughput with half precision. |
| model-benchmarks/pytorch-${model_name}/fp16_inference_step_time\_${percentile} | time (ms) | The n<sup>th</sup> percentile inference step time with half precision. |
| model-benchmarks/pytorch-${model_name}/fp16_inference_throughput\_${percentile} | throughput (samples/s) | The n<sup>th</sup> percentile inference throughput with half precision. |
...@@ -29,6 +29,8 @@ available tags are listed below for all stable versions. ...@@ -29,6 +29,8 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
|-------------------|------------------------------------| |-------------------|------------------------------------|
| v0.7.0-cuda11.8 | SuperBench v0.7.0 with CUDA 11.8 |
| v0.7.0-cuda11.1.1 | SuperBench v0.7.0 with CUDA 11.1.1 |
| v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 | | v0.6.0-cuda11.1.1 | SuperBench v0.6.0 with CUDA 11.1.1 |
| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 | | v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 | | v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
...@@ -41,6 +43,10 @@ available tags are listed below for all stable versions. ...@@ -41,6 +43,10 @@ available tags are listed below for all stable versions.
| Tag | Description | | Tag | Description |
|-------------------------------|--------------------------------------------------| |-------------------------------|--------------------------------------------------|
| v0.7.0-rocm5.1.3 | SuperBench v0.7.0 with ROCm 5.1.3 |
| v0.7.0-rocm5.1.1 | SuperBench v0.7.0 with ROCm 5.1.1 |
| v0.7.0-rocm5.0.1 | SuperBench v0.7.0 with ROCm 5.0.1 |
| v0.7.0-rocm5.0 | SuperBench v0.7.0 with ROCm 5.0 |
| v0.6.0-rocm5.1.3 | SuperBench v0.6.0 with ROCm 5.1.3 | | v0.6.0-rocm5.1.3 | SuperBench v0.6.0 with ROCm 5.1.3 |
| v0.6.0-rocm5.1.1 | SuperBench v0.6.0 with ROCm 5.1.1 | | v0.6.0-rocm5.1.1 | SuperBench v0.6.0 with ROCm 5.1.1 |
| v0.6.0-rocm5.0.1 | SuperBench v0.6.0 with ROCm 5.0.1 | | v0.6.0-rocm5.0.1 | SuperBench v0.6.0 with ROCm 5.0.1 |
......
...@@ -65,7 +65,7 @@ superbench: ...@@ -65,7 +65,7 @@ superbench:
example: example:
```yaml ```yaml
# SuperBench rules # SuperBench rules
version: v0.6 version: v0.7
superbench: superbench:
rules: rules:
failure-rule: failure-rule:
......
...@@ -58,7 +58,7 @@ superbench: ...@@ -58,7 +58,7 @@ superbench:
```yaml title="Example" ```yaml title="Example"
# SuperBench rules # SuperBench rules
version: v0.6 version: v0.7
superbench: superbench:
rules: rules:
kernel_launch: kernel_launch:
......
...@@ -6,5 +6,5 @@ ...@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems. Provide hardware and software benchmarks for AI systems.
""" """
__version__ = '0.6.0' __version__ = '0.7.0'
__author__ = 'Microsoft' __author__ = 'Microsoft'
...@@ -4,6 +4,7 @@ ...@@ -4,6 +4,7 @@
#pragma once #pragma once
#include <memory> #include <memory>
#include <stdexcept>
#include <stdio.h> #include <stdio.h>
#include <vector> #include <vector>
......
...@@ -61,15 +61,18 @@ def __init__(self, config, num_classes): ...@@ -61,15 +61,18 @@ def __init__(self, config, num_classes):
self._embedding = torch.nn.Embedding(config.vocab_size, config.hidden_size) self._embedding = torch.nn.Embedding(config.vocab_size, config.hidden_size)
# Build BERT using nn.TransformerEncoderLayer or te.TransformerLayer # Build BERT using nn.TransformerEncoderLayer or te.TransformerLayer
# input shape: (seq_len, batch_size, hidden_size) # input shape: (seq_len, batch_size, hidden_size)
encoder_layer = te.TransformerLayer( self._encoder_layers = torch.nn.ModuleList(
config.hidden_size, [
config.intermediate_size, te.TransformerLayer(
config.num_attention_heads, config.hidden_size,
apply_residual_connection_post_layernorm=True, config.intermediate_size,
output_layernorm=True, config.num_attention_heads,
layer_type='encoder', apply_residual_connection_post_layernorm=True,
output_layernorm=True,
layer_type='encoder',
) for _ in range(config.num_hidden_layers)
]
) )
self._encoder_layers = torch.nn.ModuleList([encoder_layer for _ in range(config.num_hidden_layers)])
# BertPooler used in huggingface transformers # BertPooler used in huggingface transformers
# https://github.com/huggingface/transformers/blob/accad48e/src/transformers/models/bert/modeling_bert.py#L893 # https://github.com/huggingface/transformers/blob/accad48e/src/transformers/models/bert/modeling_bert.py#L893
self._pooler = torch.nn.Sequential( self._pooler = torch.nn.Sequential(
...@@ -113,7 +116,6 @@ def __init__(self, name, parameters=''): ...@@ -113,7 +116,6 @@ def __init__(self, name, parameters=''):
Precision.FLOAT16, Precision.FLOAT16,
Precision.FP8_HYBRID, Precision.FP8_HYBRID,
Precision.FP8_E4M3, Precision.FP8_E4M3,
Precision.FP8_E5M2,
] ]
self._optimizer_type = Optimizer.ADAMW self._optimizer_type = Optimizer.ADAMW
self._loss_fn = torch.nn.CrossEntropyLoss() self._loss_fn = torch.nn.CrossEntropyLoss()
......
...@@ -44,6 +44,7 @@ def load_arguments(self, command): ...@@ -44,6 +44,7 @@ def load_arguments(self, command):
ac.argument('docker_username', type=str, help='Docker registry username if authentication is needed.') ac.argument('docker_username', type=str, help='Docker registry username if authentication is needed.')
ac.argument('docker_password', type=str, help='Docker registry password if authentication is needed.') ac.argument('docker_password', type=str, help='Docker registry password if authentication is needed.')
ac.argument('no_docker', action='store_true', help='Run on host directly without Docker.') ac.argument('no_docker', action='store_true', help='Run on host directly without Docker.')
ac.argument('no_image_pull', action='store_true', help='Skip pull and use local Docker image.')
ac.argument( ac.argument(
'host_file', options_list=('--host-file', '-f'), type=str, help='Path to Ansible inventory host file.' 'host_file', options_list=('--host-file', '-f'), type=str, help='Path to Ansible inventory host file.'
) )
......
...@@ -99,6 +99,7 @@ def process_runner_arguments( ...@@ -99,6 +99,7 @@ def process_runner_arguments(
docker_username=None, docker_username=None,
docker_password=None, docker_password=None,
no_docker=False, no_docker=False,
no_image_pull=False,
host_file=None, host_file=None,
host_list=None, host_list=None,
host_username=None, host_username=None,
...@@ -115,6 +116,7 @@ def process_runner_arguments( ...@@ -115,6 +116,7 @@ def process_runner_arguments(
docker_username (str, optional): Docker registry username if authentication is needed. Defaults to None. docker_username (str, optional): Docker registry username if authentication is needed. Defaults to None.
docker_password (str, optional): Docker registry password if authentication is needed. Defaults to None. docker_password (str, optional): Docker registry password if authentication is needed. Defaults to None.
no_docker (bool, optional): Run on host directly without Docker. Defaults to False. no_docker (bool, optional): Run on host directly without Docker. Defaults to False.
no_image_pull (bool, optional): Skip pull and use local Docker image. Defaults to False.
host_file (str, optional): Path to Ansible inventory host file. Defaults to None. host_file (str, optional): Path to Ansible inventory host file. Defaults to None.
host_list (str, optional): Comma separated host list. Defaults to None. host_list (str, optional): Comma separated host list. Defaults to None.
host_username (str, optional): Host username if needed. Defaults to None. host_username (str, optional): Host username if needed. Defaults to None.
...@@ -149,6 +151,7 @@ def process_runner_arguments( ...@@ -149,6 +151,7 @@ def process_runner_arguments(
'password': docker_password, 'password': docker_password,
'registry': split_docker_domain(docker_image)[0], 'registry': split_docker_domain(docker_image)[0],
'skip': no_docker, 'skip': no_docker,
'pull': not no_image_pull,
} }
) )
# Ansible config # Ansible config
...@@ -209,6 +212,7 @@ def deploy_command_handler( ...@@ -209,6 +212,7 @@ def deploy_command_handler(
docker_image='superbench/superbench', docker_image='superbench/superbench',
docker_username=None, docker_username=None,
docker_password=None, docker_password=None,
no_image_pull=False,
host_file=None, host_file=None,
host_list=None, host_list=None,
host_username=None, host_username=None,
...@@ -228,6 +232,7 @@ def deploy_command_handler( ...@@ -228,6 +232,7 @@ def deploy_command_handler(
docker_image (str, optional): Docker image URI. Defaults to superbench/superbench:latest. docker_image (str, optional): Docker image URI. Defaults to superbench/superbench:latest.
docker_username (str, optional): Docker registry username if authentication is needed. Defaults to None. docker_username (str, optional): Docker registry username if authentication is needed. Defaults to None.
docker_password (str, optional): Docker registry password if authentication is needed. Defaults to None. docker_password (str, optional): Docker registry password if authentication is needed. Defaults to None.
no_image_pull (bool, optional): Skip pull and use local Docker image. Defaults to False.
host_file (str, optional): Path to Ansible inventory host file. Defaults to None. host_file (str, optional): Path to Ansible inventory host file. Defaults to None.
host_list (str, optional): Comma separated host list. Defaults to None. host_list (str, optional): Comma separated host list. Defaults to None.
host_username (str, optional): Host username if needed. Defaults to None. host_username (str, optional): Host username if needed. Defaults to None.
...@@ -243,6 +248,7 @@ def deploy_command_handler( ...@@ -243,6 +248,7 @@ def deploy_command_handler(
docker_username=docker_username, docker_username=docker_username,
docker_password=docker_password, docker_password=docker_password,
no_docker=False, no_docker=False,
no_image_pull=no_image_pull,
host_file=host_file, host_file=host_file,
host_list=host_list, host_list=host_list,
host_username=host_username, host_username=host_username,
...@@ -298,6 +304,7 @@ def run_command_handler( ...@@ -298,6 +304,7 @@ def run_command_handler(
docker_username=docker_username, docker_username=docker_username,
docker_password=docker_password, docker_password=docker_password,
no_docker=no_docker, no_docker=no_docker,
no_image_pull=False,
host_file=host_file, host_file=host_file,
host_list=host_list, host_list=host_list,
host_username=host_username, host_username=host_username,
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# Server: # Server:
# - Product: HPE Apollo 6500 # - Product: HPE Apollo 6500
version: v0.6 version: v0.7
superbench: superbench:
enable: null enable: null
var: var:
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
# - Product: G482-Z53 # - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html # - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.6 version: v0.7
superbench: superbench:
enable: null enable: null
var: var:
......
version: v0.6 version: v0.7
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
version: v0.6 version: v0.7
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
version: v0.6 version: v0.7
superbench: superbench:
enable: null enable: null
monitor: monitor:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment