Unverified Commit 6681c720 authored by Yifan Xiong's avatar Yifan Xiong Committed by GitHub
Browse files

Release - SuperBench v0.5.0 (#350)



**Description**

Cherry-pick  bug fixes from v0.5.0 to main.

**Major Revisions**

* Bug - Force to fix ort version as '1.10.0' (#343)
* Bug - Support no matching rules and unify the output name in result_summary (#345)
* Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344)
* Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342)
* Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347)
* Docs - Upgrade version and release note (#348)
Co-authored-by: default avatarYuting Jiang <v-yutjiang@microsoft.com>
parent 712eafc3
......@@ -15,7 +15,7 @@
__SuperBench__ is a validation and profiling tool for AI infrastructure.
📢 [v0.4.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.4.0) has been released!
📢 [v0.5.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.5.0) has been released!
## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._
......
......@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
:::note Note
You should checkout corresponding tag to use release version, for example,
`git clone -b v0.4.0 https://github.com/microsoft/superbenchmark`
`git clone -b v0.5.0 https://github.com/microsoft/superbenchmark`
:::
```bash
......
......@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
:::note Note
You should deploy corresponding Docker image to use release version, for example,
`sb deploy -f local.ini -i superbench/superbench:v0.4.0-cuda11.1.1`
`sb deploy -f local.ini -i superbench/superbench:v0.5.0-cuda11.1.1`
You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.
......
......@@ -70,7 +70,7 @@ superbench:
<TabItem value='example'>
```yaml
version: v0.4
version: v0.5
superbench:
enable: benchmark_1
monitor:
......
......@@ -28,7 +28,8 @@ available tags are listed below for all stable versions.
<TabItem value='cuda'>
| Tag | Description |
| ----------------- | ---------------------------------- |
|-------------------|------------------------------------|
| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
| v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
| v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
| v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
......@@ -38,7 +39,11 @@ available tags are listed below for all stable versions.
<TabItem value='rocm'>
| Tag | Description |
| --------------------------- | ---------------------------------------------- |
|-------------------------------|--------------------------------------------------|
| v0.5.0-rocm5.0.1-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0.1, PyTorch 1.9.0 |
| v0.5.0-rocm5.0-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0, PyTorch 1.9.0 |
| v0.5.0-rocm4.2-pytorch1.7.0 | SuperBench v0.5.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.5.0-rocm4.0-pytorch1.7.0 | SuperBench v0.5.0 with ROCm 4.0, PyTorch 1.7.0 |
| v0.4.0-rocm4.2-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.2, PyTorch 1.7.0 |
| v0.4.0-rocm4.0-pytorch1.7.0 | SuperBench v0.4.0 with ROCm 4.0, PyTorch 1.7.0 |
| v0.3.0-rocm4.2-pytorch1.7.0 | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0 |
......
......@@ -65,7 +65,7 @@ superbench:
example:
```yaml
# SuperBench rules
version: v0.4
version: v0.5
superbench:
rules:
failure-rule:
......
......@@ -20,14 +20,12 @@ This tool is to generate a readable summary report based on the raw benchmark re
sb result summary --data-file ./results-summary.jsonl --rule-file ./rule.yaml --output-file-format md --output-dir ${output-dir}
```
4. Find the output result file named 'results_summary.md' under ${output_dir}.
4. Find the output result file named 'results-summary.md' under ${output_dir}.
## Input
The input includes 2 files:
- **Raw Data**: jsonl file including multiple nodes' results automatically generated by SuperBench runner.
:::tip Tips
......@@ -60,7 +58,7 @@ superbench:
```yaml title="Example"
# SuperBench rules
version: v0.4
version: v0.5
superbench:
rules:
kernel_launch:
......@@ -122,3 +120,8 @@ The following illustrates all statistical functions:
- `min`
- `p${value}`: ${value} can be 1-99. For example, p50, p90, etc.
- `std`
## Output
We support different output formats for result sumamry including markdown, html, etc.
The output includes the metrics grouped by category and their values obtained by applying statistical methods to all raw results.
......@@ -173,7 +173,7 @@ def run(self):
'nvidia': ['py3nvml>=0.2.6'],
'ort': [
'onnx>=1.10.2',
'onnxruntime-gpu>=1.9.0',
'onnxruntime-gpu==1.10.0',
],
'torch': [
'torch>=1.7.0a0',
......
......@@ -6,5 +6,5 @@
Provide hardware and software benchmarks for AI systems.
"""
__version__ = '0.4.0'
__version__ = '0.5.0'
__author__ = 'Microsoft'
......@@ -84,19 +84,23 @@ def _parse_rules(self, rules):
logger.error('ResultSummary: parse rules failed - {}'.format(str(e)))
return False
def _format_summary_of_rule(self, category, summary_df_of_rule):
def _format_summary_of_rule(self, category, summary_df_of_rule, statistics):
"""Format summary_df of a rule info list of lines.
Args:
category (str): category in the rule
summary_df_of_rule ([type]): summary df of a rule, the columns are metrics, the index are statistics
statistics (list): statistics in the rule
Returns:
list: list of summary lines like [category, metric, statistic, value]
"""
summary = []
metrics = summary_df_of_rule.columns
if metrics.empty is True:
for statistic in statistics:
summary.append([category, '', statistic, ''])
for metric in metrics:
for statistic in summary_df_of_rule.index:
for statistic in statistics:
summary.append([category, metric, statistic, summary_df_of_rule.loc[statistic, metric]])
return summary
......@@ -132,6 +136,10 @@ def _generate_summary(self, round):
metrics = list(self._sb_rules[rule]['metrics'].keys())
category = self._sb_rules[rule]['categories']
data_df_of_rule = self._raw_data_df[metrics]
statistics = self._sb_rules[rule]['statistics']
summary_df_of_rule = pd.DataFrame()
# skip metrics aggregation and statistics calculation fot the rule with no matched metrics
if len(metrics) != 0:
if self._sb_rules[rule]['aggregate']:
# if aggregate is True, aggregate in ranks
if self._sb_rules[rule]['aggregate'] is True:
......@@ -139,7 +147,6 @@ def _generate_summary(self, round):
# if aggregate is not empty and is a pattern in regex, aggregate according to pattern
else:
data_df_of_rule = data_analysis.aggregate(data_df_of_rule, self._sb_rules[rule]['aggregate'])
statistics = self._sb_rules[rule]['statistics']
summary_df_of_rule = pd.DataFrame(columns=sorted(data_df_of_rule.columns))
for statistic_name in statistics:
# get SummaryOp and calculate statistics
......@@ -157,7 +164,7 @@ def _generate_summary(self, round):
summary_df_of_rule, round, list(summary_df_of_rule.columns)
)
# format summary_df of a rule to list of lines
summary_lines_of_rule = self._format_summary_of_rule(category, summary_df_of_rule)
summary_lines_of_rule = self._format_summary_of_rule(category, summary_df_of_rule, statistics)
summary[category] = summary_lines_of_rule
return summary
......@@ -233,15 +240,15 @@ def run(self, raw_data_file, rule_file, output_dir, output_format, round=2):
# output result summary to file
output_path = ''
if output_format == 'excel':
output_path = str(Path(output_dir) / 'results_summary.xlsx')
output_path = str(Path(output_dir) / 'results-summary.xlsx')
summary_df = self._merge_summary(summary)
self.output_summary_in_excel(self._raw_data_df, summary_df, output_path)
elif output_format == 'md':
output_path = str(Path(output_dir) / 'results_summary.md')
output_path = str(Path(output_dir) / 'results-summary.md')
lines = self.generate_md_lines(summary)
file_handler.output_lines_in_md(lines, output_path)
elif output_format == 'html':
output_path = str(Path(output_dir) / 'results_summary.html')
output_path = str(Path(output_dir) / 'results-summary.html')
lines = self.generate_md_lines(summary)
file_handler.output_lines_in_html(lines, output_path)
else:
......
......@@ -32,6 +32,9 @@ def _get_metrics_by_benchmarks(self, metrics_list):
logger.warning('RuleBase: get_metrics_by_benchmarks - {} does not have benchmark_name'.format(metric))
else:
benchmark = metric.split('/')[0]
# support annotations in benchmark naming
if ':' in benchmark:
benchmark = metric.split(':')[0]
if benchmark not in benchmarks_metrics:
benchmarks_metrics[benchmark] = set()
benchmarks_metrics[benchmark].add(metric)
......
......@@ -35,6 +35,7 @@ def __init__(self, name, parameters=''):
self._benchmark_type = BenchmarkType.MODEL
self._world_size = 1
self._local_rank = None
self._global_rank = None
self._dataset = None
self._dataloader = None
self._model = None
......@@ -242,7 +243,8 @@ def __train(self, precision):
# The unit of step time should be millisecond.
step_times = self._train_step(precision)
if not self.__process_model_result(ModelAction.TRAIN, precision, step_times):
step_times = self.__process_model_result(ModelAction.TRAIN, precision, step_times)
if not step_times:
self._result.set_return_code(ReturnCode.INVALID_BENCHMARK_RESULT)
return False
......@@ -266,7 +268,8 @@ def __inference(self, precision):
self._create_model(precision)
# The unit of step time should be millisecond.
step_times = self._inference_step(precision)
if not self.__process_model_result(ModelAction.INFERENCE, precision, step_times):
step_times = self.__process_model_result(ModelAction.INFERENCE, precision, step_times)
if not step_times:
self._result.set_return_code(ReturnCode.INVALID_BENCHMARK_RESULT)
return False
......@@ -369,9 +372,9 @@ def _sync_result(self, result):
result (list): The result data to sync.
Return:
True if reduce result data successfully.
Result if reduce result data successfully, otherwise None.
"""
return True
return result
def __process_model_result(self, model_action, precision, step_times):
"""Function to process raw results and save the summarized results.
......@@ -382,7 +385,7 @@ def __process_model_result(self, model_action, precision, step_times):
step_times (list): The step time list of every training/inference step, unit is millisecond.
Return:
True if step_times list is not empty.
step_times if step_times list is not empty, otherwise None.
"""
if len(step_times) == 0:
logger.error(
......@@ -390,7 +393,7 @@ def __process_model_result(self, model_action, precision, step_times):
self._curr_run_index, self._name, model_action, precision
)
)
return False
return None
precision_metric = {'float16': 'fp16', 'float32': 'fp32', 'float64': 'fp64', 'bfloat16': 'bf16'}
if precision.value in precision_metric.keys():
......@@ -404,9 +407,10 @@ def __process_model_result(self, model_action, precision, step_times):
self._result.add_raw_data(metric_t, throughput, self._args.log_raw_data)
if model_action == ModelAction.TRAIN:
if not self._sync_result(step_times):
return False
if self._local_rank is None or self._local_rank == 0:
step_times = self._sync_result(step_times)
if not step_times:
return None
if self._local_rank is None or self._global_rank == 0:
self._result.add_result(metric_s, statistics.mean(step_times))
throughput = [millisecond_per_second / step_time * self._args.batch_size for step_time in step_times]
self._result.add_result(metric_t, statistics.mean(throughput))
......@@ -416,7 +420,7 @@ def __process_model_result(self, model_action, precision, step_times):
self._process_percentile_result(metric_s, step_times)
self._process_percentile_result(metric_t, throughput)
return True
return step_times
@abstractmethod
def _cal_params_count(self):
......
......@@ -5,6 +5,7 @@
import os
from datetime import timedelta
import time
import torch
import transformers
......@@ -60,6 +61,7 @@ def _init_distributed_setting(self):
hvd.init()
self._world_size = int(hvd.size())
self._local_rank = int(hvd.local_rank())
self._global_rank = int(hvd.rank())
elif self._args.distributed_impl == DistributedImpl.DDP:
if os.environ.get('WORLD_SIZE') is None or os.environ.get('LOCAL_RANK') is None:
logger.error(
......@@ -70,17 +72,17 @@ def _init_distributed_setting(self):
# torch >= 1.9.0a0 torch.distributed.elastic is used by default
port = int(os.environ['MASTER_PORT']) + 1
addr = os.environ['MASTER_ADDR']
global_rank = int(os.environ['RANK'])
self._global_rank = int(os.environ['RANK'])
self._local_rank = int(os.environ['LOCAL_RANK'])
self._world_size = int(os.environ['WORLD_SIZE'])
logger.debug('ip:{},port:{},rank:{},world:{}'.format(addr, port, global_rank, self._world_size))
logger.debug('ip:{},port:{},rank:{},world:{}'.format(addr, port, self._global_rank, self._world_size))
store = PrefixStore(
self._name, TCPStore(addr, port, self._world_size, global_rank == 0, timedelta(seconds=300))
self._name, TCPStore(addr, port, self._world_size, self._global_rank == 0, timedelta(seconds=300))
)
torch.distributed.init_process_group(
backend=self._args.distributed_backend.value,
timeout=timedelta(seconds=300),
rank=global_rank,
rank=self._global_rank,
world_size=self._world_size,
store=store
)
......@@ -188,6 +190,33 @@ def _create_optimizer(self):
return True
def _is_finished(self, curr_step, curr_time, check_frequency=100):
"""Judge whether the benchmarking should be stopped early or not.
Args:
curr_step (int): the current benchmarking step.
curr_time (float): the current time in seconds got from time.time().
check_frequency (int): the frequency (step numbers) to check if benchmark should be stopped.
Return:
True if the benchmarking should be stopped.
"""
is_finished = int(super()._is_finished(curr_step, curr_time))
if self._args.duration > 0:
if curr_step % check_frequency == 0:
# sync is_finished in distributed mode
# if any rank is_finished is True, all ranks should be finished
if self._args.distributed_impl == DistributedImpl.DDP:
tensor = torch.IntTensor([is_finished])
if self._args.distributed_backend == DistributedBackend.NCCL:
tensor = tensor.cuda()
torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.MAX)
is_finished = tensor.tolist()[0]
else:
is_finished = 0
return (is_finished == 1)
def _sync_result(self, result):
"""Function to reduce the result to rank 0.
......@@ -195,10 +224,11 @@ def _sync_result(self, result):
result (list): The result data to sync.
Return:
True if reduce result data successfully.
Result if reduce result data successfully, otherwise None.
"""
if not super()._sync_result(result):
return False
result = super()._sync_result(result)
if not result:
return None
try:
if self._args.distributed_impl == DistributedImpl.DDP:
......@@ -206,7 +236,7 @@ def _sync_result(self, result):
tensor = torch.as_tensor(result).cuda()
else:
tensor = torch.as_tensor(result)
torch.distributed.reduce(tensor, 0, op=torch.distributed.ReduceOp.MAX)
torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.MAX)
result = tensor.tolist()
except BaseException as e:
logger.error(
......@@ -214,9 +244,9 @@ def _sync_result(self, result):
self._name, self._args.distributed_impl, str(e)
)
)
return False
return None
return True
return result
def _postprocess(self):
"""Postprocess/cleanup operations after the benchmarking.
......@@ -257,3 +287,16 @@ def _cal_params_count(self):
The count of trainable parameters.
"""
return sum(p.numel() for p in self._model.parameters() if p.requires_grad)
def _timer(self):
"""Returns the current time which ensures all previous CUDA events have been finished.
If there is no GPU present, this defaults to `time.time()`; otherwise it will
synchronize CUDA before measuring the time.
Returns:
Current time in second.
"""
if self._gpu_available:
torch.cuda.synchronize()
return time.time()
......@@ -3,8 +3,6 @@
"""Module of the Pytorch BERT model."""
import time
import torch
from transformers import BertModel, BertConfig
......@@ -137,9 +135,10 @@ def _train_step(self, precision):
"""
duration = []
curr_step = 0
check_frequency = 100
while True:
for idx, sample in enumerate(self._dataloader):
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._optimizer.zero_grad()
......@@ -147,12 +146,12 @@ def _train_step(self, precision):
loss = self._loss_fn(output, self._target)
loss.backward()
self._optimizer.step()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
duration.append((end - start) * 1000)
if self._is_finished(curr_step, end):
if self._is_finished(curr_step, end, check_frequency):
return duration
def _inference_step(self, precision):
......@@ -171,13 +170,11 @@ def _inference_step(self, precision):
self._model.eval()
while True:
for idx, sample in enumerate(self._dataloader):
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._model(sample)
if self._gpu_available:
torch.cuda.synchronize()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
......
......@@ -3,8 +3,6 @@
"""Module of the Pytorch CNN models."""
import time
import torch
from torchvision import models
......@@ -99,10 +97,11 @@ def _train_step(self, precision):
"""
duration = []
curr_step = 0
check_frequency = 100
while True:
for idx, sample in enumerate(self._dataloader):
sample = sample.to(dtype=getattr(torch, precision.value))
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._optimizer.zero_grad()
......@@ -110,12 +109,12 @@ def _train_step(self, precision):
loss = self._loss_fn(output, self._target)
loss.backward()
self._optimizer.step()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
duration.append((end - start) * 1000)
if self._is_finished(curr_step, end):
if self._is_finished(curr_step, end, check_frequency):
return duration
def _inference_step(self, precision):
......@@ -135,13 +134,11 @@ def _inference_step(self, precision):
while True:
for idx, sample in enumerate(self._dataloader):
sample = sample.to(dtype=getattr(torch, precision.value))
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._model(sample)
if self._gpu_available:
torch.cuda.synchronize()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
......
......@@ -3,8 +3,6 @@
"""Module of the Pytorch GPT2 model."""
import time
import torch
from transformers import GPT2Model, GPT2Config
......@@ -131,9 +129,10 @@ def _train_step(self, precision):
"""
duration = []
curr_step = 0
check_frequency = 100
while True:
for idx, sample in enumerate(self._dataloader):
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._optimizer.zero_grad()
......@@ -141,12 +140,12 @@ def _train_step(self, precision):
loss = self._loss_fn(output[range(self._args.batch_size), -1], self._target)
loss.backward()
self._optimizer.step()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
duration.append((end - start) * 1000)
if self._is_finished(curr_step, end):
if self._is_finished(curr_step, end, check_frequency):
return duration
def _inference_step(self, precision):
......@@ -165,13 +164,11 @@ def _inference_step(self, precision):
self._model.eval()
while True:
for idx, sample in enumerate(self._dataloader):
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._model(sample)
if self._gpu_available:
torch.cuda.synchronize()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
......
......@@ -3,8 +3,6 @@
"""Module of the Pytorch LSTM model."""
import time
import torch
from superbench.common.utils import logger
......@@ -139,10 +137,11 @@ def _train_step(self, precision):
"""
duration = []
curr_step = 0
check_frequency = 100
while True:
for idx, sample in enumerate(self._dataloader):
sample = sample.to(dtype=getattr(torch, precision.value))
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._optimizer.zero_grad()
......@@ -150,12 +149,12 @@ def _train_step(self, precision):
loss = self._loss_fn(output, self._target)
loss.backward()
self._optimizer.step()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
duration.append((end - start) * 1000)
if self._is_finished(curr_step, end):
if self._is_finished(curr_step, end, check_frequency):
return duration
def _inference_step(self, precision):
......@@ -175,13 +174,11 @@ def _inference_step(self, precision):
while True:
for idx, sample in enumerate(self._dataloader):
sample = sample.to(dtype=getattr(torch, precision.value))
start = time.time()
start = self._timer()
if self._gpu_available:
sample = sample.cuda()
self._model(sample)
if self._gpu_available:
torch.cuda.synchronize()
end = time.time()
end = self._timer()
curr_step += 1
if curr_step > self._args.num_warmup:
# Save the step time of every training/inference step, unit is millisecond.
......
......@@ -3,7 +3,7 @@
# Server:
# - Product: HPE Apollo 6500
version: v0.4
version: v0.5
superbench:
enable: null
var:
......
......@@ -4,7 +4,7 @@
# - Product: G482-Z53
# - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html
version: v0.4
version: v0.5
superbench:
enable: null
var:
......
version: v0.4
version: v0.5
superbench:
enable: null
monitor:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment