Release - SuperBench v0.5.0 (#350)

**Description** Cherry-pick bug fixes from v0.5.0 to main. **Major Revisions** * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>

Release - SuperBench v0.5.0 (#350)
**Description** Cherry-pick bug fixes from v0.5.0 to main. **Major Revisions** * Bug - Force to fix ort version as '1.10.0' (#343) * Bug - Support no matching rules and unify the output name in result_summary (#345) * Analyzer - Support regex in annotations of benchmark naming for metrics in rules (#344) * Bug - Fix bugs in sync results on root rank for e2e model benchmarks (#342) * Bug - Fix bug of duration feature for model benchmarks in distributed mode (#347) * Docs - Upgrade version and release note (#348) Co-authored-by: Yuting Jiang <v-yutjiang@microsoft.com>
6681c720 · Yifan Xiong · GitHub · 712eafc3 · 6681c720 · 6681c720
Unverified Commit 6681c720 authored Apr 29, 2022 by Yifan Xiong Committed by GitHub Apr 29, 2022
20 changed files
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@

 __SuperBench__ is a validation and profiling tool for AI infrastructure.

-📢 [v0.4.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.4.0) has been released!
+📢 [v0.5.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.5.0) has been released!

 ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._


--- a/docs/getting-started/installation.mdx
+++ b/docs/getting-started/installation.mdx
@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
 :::note Note
 You should checkout corresponding tag to use release version, for example,

-`git clone -b v0.4.0 https://github.com/microsoft/superbenchmark`
+`git clone -b v0.5.0 https://github.com/microsoft/superbenchmark`
 :::

 ```bash

--- a/docs/getting-started/run-superbench.md
+++ b/docs/getting-started/run-superbench.md
@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
 :::note Note
 You should deploy corresponding Docker image to use release version, for example,

-`sb deploy -f local.ini -i superbench/superbench:v0.4.0-cuda11.1.1`
+`sb deploy -f local.ini -i superbench/superbench:v0.5.0-cuda11.1.1`

 You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.


--- a/docs/superbench-config.mdx
+++ b/docs/superbench-config.mdx
@@ -70,7 +70,7 @@ superbench:
 <TabItem value='example'>

 ```yaml
-version: v0.4
+version: v0.5
 superbench:
  enable: benchmark_1
  monitor:

--- a/docs/user-tutorial/container-images.mdx
+++ b/docs/user-tutorial/container-images.mdx
@@ -28,7 +28,8 @@ available tags are listed below for all stable versions.
 <TabItem value='cuda'>

 | Tag               | Description                        |
-| ----------------- | ---------------------------------- |
+|-------------------|------------------------------------|
+| v0.5.0-cuda11.1.1 | SuperBench v0.5.0 with CUDA 11.1.1 |
 | v0.4.0-cuda11.1.1 | SuperBench v0.4.0 with CUDA 11.1.1 |
 | v0.3.0-cuda11.1.1 | SuperBench v0.3.0 with CUDA 11.1.1 |
 | v0.2.1-cuda11.1.1 | SuperBench v0.2.1 with CUDA 11.1.1 |
@@ -38,7 +39,11 @@ available tags are listed below for all stable versions.
 <TabItem value='rocm'>

 | Tag                           | Description                                      |
-| --------------------------- | ---------------------------------------------- |
+|-------------------------------|--------------------------------------------------|
+| v0.5.0-rocm5.0.1-pytorch1.9.0 | SuperBench v0.5.0 with ROCm 5.0.1, PyTorch 1.9.0 |
+| v0.5.0-rocm5.0-pytorch1.9.0   | SuperBench v0.5.0 with ROCm 5.0, PyTorch 1.9.0   |
+| v0.5.0-rocm4.2-pytorch1.7.0   | SuperBench v0.5.0 with ROCm 4.2, PyTorch 1.7.0   |
+| v0.5.0-rocm4.0-pytorch1.7.0   | SuperBench v0.5.0 with ROCm 4.0, PyTorch 1.7.0   |
 | v0.4.0-rocm4.2-pytorch1.7.0   | SuperBench v0.4.0 with ROCm 4.2, PyTorch 1.7.0   |
 | v0.4.0-rocm4.0-pytorch1.7.0   | SuperBench v0.4.0 with ROCm 4.0, PyTorch 1.7.0   |
 | v0.3.0-rocm4.2-pytorch1.7.0   | SuperBench v0.3.0 with ROCm 4.2, PyTorch 1.7.0   |

--- a/docs/user-tutorial/data-diagnosis.md
+++ b/docs/user-tutorial/data-diagnosis.md
@@ -65,7 +65,7 @@ superbench:
 example:
 ```yaml
 # SuperBench rules
-version: v0.4
+version: v0.5
 superbench:
  rules:
    failure-rule:

--- a/docs/user-tutorial/result-summary.md
+++ b/docs/user-tutorial/result-summary.md
@@ -20,14 +20,12 @@ This tool is to generate a readable summary report based on the raw benchmark re
  sb result summary --data-file ./results-summary.jsonl --rule-file ./rule.yaml --output-file-format md --output-dir ${output-dir}
  ```

-4. Find the output result file named 'results_summary.md' under ${output_dir}.
+4. Find the output result file named 'results-summary.md' under ${output_dir}.

 ## Input

 The input includes 2 files:

-
-
 - **Raw Data**: jsonl file including multiple nodes' results automatically generated by SuperBench runner.

 :::tip Tips
@@ -60,7 +58,7 @@ superbench:

 ```yaml title="Example"
 # SuperBench rules
-version: v0.4
+version: v0.5
 superbench:
  rules:
    kernel_launch:
@@ -122,3 +120,8 @@ The following illustrates all statistical functions:
 - `min`
 - `p${value}`: ${value} can be 1-99. For example, p50, p90, etc.
 - `std`
+
+## Output
+
+We support different output formats for result sumamry including markdown, html, etc.
+The output includes the metrics grouped by category and their values obtained by applying statistical methods to all raw results.
--- a/setup.py
+++ b/setup.py
@@ -173,7 +173,7 @@ def run(self):
        'nvidia': ['py3nvml>=0.2.6'],
        'ort': [
            'onnx>=1.10.2',
-            'onnxruntime-gpu>=1.9.0',
+            'onnxruntime-gpu==1.10.0',
        ],
        'torch': [
            'torch>=1.7.0a0',

--- a/superbench/__init__.py
+++ b/superbench/__init__.py
@@ -6,5 +6,5 @@
 Provide hardware and software benchmarks for AI systems.
 """

-__version__ = '0.4.0'
+__version__ = '0.5.0'
 __author__ = 'Microsoft'
--- a/superbench/analyzer/result_summary.py
+++ b/superbench/analyzer/result_summary.py
@@ -84,19 +84,23 @@ def _parse_rules(self, rules):
            logger.error('ResultSummary: parse rules failed - {}'.format(str(e)))
            return False

-    def _format_summary_of_rule(self, category, summary_df_of_rule):
+    def _format_summary_of_rule(self, category, summary_df_of_rule, statistics):
        """Format summary_df of a rule info list of lines.

        Args:
            category (str): category in the rule
            summary_df_of_rule ([type]): summary df of a rule, the columns are metrics, the index are statistics
+            statistics (list): statistics in the rule
        Returns:
            list: list of summary lines like [category, metric, statistic, value]
        """
        summary = []
        metrics = summary_df_of_rule.columns
+        if metrics.empty is True:
+            for statistic in statistics:
+                summary.append([category, '', statistic, ''])
        for metric in metrics:
-            for statistic in summary_df_of_rule.index:
+            for statistic in statistics:
                summary.append([category, metric, statistic, summary_df_of_rule.loc[statistic, metric]])
        return summary

@@ -132,6 +136,10 @@ def _generate_summary(self, round):
            metrics = list(self._sb_rules[rule]['metrics'].keys())
            category = self._sb_rules[rule]['categories']
            data_df_of_rule = self._raw_data_df[metrics]
+            statistics = self._sb_rules[rule]['statistics']
+            summary_df_of_rule = pd.DataFrame()
+            # skip metrics aggregation and statistics calculation fot the rule with no matched metrics
+            if len(metrics) != 0:
                if self._sb_rules[rule]['aggregate']:
                    # if aggregate is True, aggregate in ranks
                    if self._sb_rules[rule]['aggregate'] is True:
@@ -139,7 +147,6 @@ def _generate_summary(self, round):
                    # if aggregate is not empty and is a pattern in regex, aggregate according to pattern
                    else:
                        data_df_of_rule = data_analysis.aggregate(data_df_of_rule, self._sb_rules[rule]['aggregate'])
-            statistics = self._sb_rules[rule]['statistics']
                summary_df_of_rule = pd.DataFrame(columns=sorted(data_df_of_rule.columns))
                for statistic_name in statistics:
                    # get SummaryOp and calculate statistics
@@ -157,7 +164,7 @@ def _generate_summary(self, round):
                        summary_df_of_rule, round, list(summary_df_of_rule.columns)
                    )
            # format summary_df of a rule to list of lines
-            summary_lines_of_rule = self._format_summary_of_rule(category, summary_df_of_rule)
+            summary_lines_of_rule = self._format_summary_of_rule(category, summary_df_of_rule, statistics)
            summary[category] = summary_lines_of_rule

        return summary
@@ -233,15 +240,15 @@ def run(self, raw_data_file, rule_file, output_dir, output_format, round=2):
            # output result summary to file
            output_path = ''
            if output_format == 'excel':
-                output_path = str(Path(output_dir) / 'results_summary.xlsx')
+                output_path = str(Path(output_dir) / 'results-summary.xlsx')
                summary_df = self._merge_summary(summary)
                self.output_summary_in_excel(self._raw_data_df, summary_df, output_path)
            elif output_format == 'md':
-                output_path = str(Path(output_dir) / 'results_summary.md')
+                output_path = str(Path(output_dir) / 'results-summary.md')
                lines = self.generate_md_lines(summary)
                file_handler.output_lines_in_md(lines, output_path)
            elif output_format == 'html':
-                output_path = str(Path(output_dir) / 'results_summary.html')
+                output_path = str(Path(output_dir) / 'results-summary.html')
                lines = self.generate_md_lines(summary)
                file_handler.output_lines_in_html(lines, output_path)
            else:

--- a/superbench/analyzer/rule_base.py
+++ b/superbench/analyzer/rule_base.py
@@ -32,6 +32,9 @@ def _get_metrics_by_benchmarks(self, metrics_list):
                logger.warning('RuleBase: get_metrics_by_benchmarks - {} does not have benchmark_name'.format(metric))
            else:
                benchmark = metric.split('/')[0]
+                # support annotations in benchmark naming
+                if ':' in benchmark:
+                    benchmark = metric.split(':')[0]
                if benchmark not in benchmarks_metrics:
                    benchmarks_metrics[benchmark] = set()
                benchmarks_metrics[benchmark].add(metric)

--- a/superbench/benchmarks/model_benchmarks/model_base.py
+++ b/superbench/benchmarks/model_benchmarks/model_base.py
@@ -35,6 +35,7 @@ def __init__(self, name, parameters=''):
        self._benchmark_type = BenchmarkType.MODEL
        self._world_size = 1
        self._local_rank = None
+        self._global_rank = None
        self._dataset = None
        self._dataloader = None
        self._model = None
@@ -242,7 +243,8 @@ def __train(self, precision):

        # The unit of step time should be millisecond.
        step_times = self._train_step(precision)
-        if not self.__process_model_result(ModelAction.TRAIN, precision, step_times):
+        step_times = self.__process_model_result(ModelAction.TRAIN, precision, step_times)
+        if not step_times:
            self._result.set_return_code(ReturnCode.INVALID_BENCHMARK_RESULT)
            return False

@@ -266,7 +268,8 @@ def __inference(self, precision):
        self._create_model(precision)
        # The unit of step time should be millisecond.
        step_times = self._inference_step(precision)
-        if not self.__process_model_result(ModelAction.INFERENCE, precision, step_times):
+        step_times = self.__process_model_result(ModelAction.INFERENCE, precision, step_times)
+        if not step_times:
            self._result.set_return_code(ReturnCode.INVALID_BENCHMARK_RESULT)
            return False

@@ -369,9 +372,9 @@ def _sync_result(self, result):
            result (list): The result data to sync.

        Return:
-            True if reduce result data successfully.
+            Result if reduce result data successfully, otherwise None.
        """
-        return True
+        return result

    def __process_model_result(self, model_action, precision, step_times):
        """Function to process raw results and save the summarized results.
@@ -382,7 +385,7 @@ def __process_model_result(self, model_action, precision, step_times):
            step_times (list): The step time list of every training/inference step, unit is millisecond.

        Return:
-            True if step_times list is not empty.
+            step_times if step_times list is not empty, otherwise None.
        """
        if len(step_times) == 0:
            logger.error(
@@ -390,7 +393,7 @@ def __process_model_result(self, model_action, precision, step_times):
                    self._curr_run_index, self._name, model_action, precision
                )
            )
-            return False
+            return None

        precision_metric = {'float16': 'fp16', 'float32': 'fp32', 'float64': 'fp64', 'bfloat16': 'bf16'}
        if precision.value in precision_metric.keys():
@@ -404,9 +407,10 @@ def __process_model_result(self, model_action, precision, step_times):
        self._result.add_raw_data(metric_t, throughput, self._args.log_raw_data)

        if model_action == ModelAction.TRAIN:
-            if not self._sync_result(step_times):
-                return False
-            if self._local_rank is None or self._local_rank == 0:
+            step_times = self._sync_result(step_times)
+            if not step_times:
+                return None
+            if self._local_rank is None or self._global_rank == 0:
                self._result.add_result(metric_s, statistics.mean(step_times))
                throughput = [millisecond_per_second / step_time * self._args.batch_size for step_time in step_times]
                self._result.add_result(metric_t, statistics.mean(throughput))
@@ -416,7 +420,7 @@ def __process_model_result(self, model_action, precision, step_times):
            self._process_percentile_result(metric_s, step_times)
            self._process_percentile_result(metric_t, throughput)

-        return True
+        return step_times

    @abstractmethod
    def _cal_params_count(self):

--- a/superbench/benchmarks/model_benchmarks/pytorch_base.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_base.py
@@ -5,6 +5,7 @@

 import os
 from datetime import timedelta
+import time

 import torch
 import transformers
@@ -60,6 +61,7 @@ def _init_distributed_setting(self):
                hvd.init()
                self._world_size = int(hvd.size())
                self._local_rank = int(hvd.local_rank())
+                self._global_rank = int(hvd.rank())
            elif self._args.distributed_impl == DistributedImpl.DDP:
                if os.environ.get('WORLD_SIZE') is None or os.environ.get('LOCAL_RANK') is None:
                    logger.error(
@@ -70,17 +72,17 @@ def _init_distributed_setting(self):
                # torch >= 1.9.0a0 torch.distributed.elastic is used by default
                port = int(os.environ['MASTER_PORT']) + 1
                addr = os.environ['MASTER_ADDR']
-                global_rank = int(os.environ['RANK'])
+                self._global_rank = int(os.environ['RANK'])
                self._local_rank = int(os.environ['LOCAL_RANK'])
                self._world_size = int(os.environ['WORLD_SIZE'])
-                logger.debug('ip:{},port:{},rank:{},world:{}'.format(addr, port, global_rank, self._world_size))
+                logger.debug('ip:{},port:{},rank:{},world:{}'.format(addr, port, self._global_rank, self._world_size))
                store = PrefixStore(
-                    self._name, TCPStore(addr, port, self._world_size, global_rank == 0, timedelta(seconds=300))
+                    self._name, TCPStore(addr, port, self._world_size, self._global_rank == 0, timedelta(seconds=300))
                )
                torch.distributed.init_process_group(
                    backend=self._args.distributed_backend.value,
                    timeout=timedelta(seconds=300),
-                    rank=global_rank,
+                    rank=self._global_rank,
                    world_size=self._world_size,
                    store=store
                )
@@ -188,6 +190,33 @@ def _create_optimizer(self):

        return True

+    def _is_finished(self, curr_step, curr_time, check_frequency=100):
+        """Judge whether the benchmarking should be stopped early or not.
+
+        Args:
+            curr_step (int): the current benchmarking step.
+            curr_time (float): the current time in seconds got from time.time().
+            check_frequency (int): the frequency (step numbers) to check if benchmark should be stopped.
+
+        Return:
+            True if the benchmarking should be stopped.
+        """
+        is_finished = int(super()._is_finished(curr_step, curr_time))
+        if self._args.duration > 0:
+            if curr_step % check_frequency == 0:
+                # sync is_finished in distributed mode
+                # if any rank is_finished is True, all ranks should be finished
+                if self._args.distributed_impl == DistributedImpl.DDP:
+                    tensor = torch.IntTensor([is_finished])
+                    if self._args.distributed_backend == DistributedBackend.NCCL:
+                        tensor = tensor.cuda()
+                    torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.MAX)
+                    is_finished = tensor.tolist()[0]
+            else:
+                is_finished = 0
+
+        return (is_finished == 1)
+
    def _sync_result(self, result):
        """Function to reduce the result to rank 0.

@@ -195,10 +224,11 @@ def _sync_result(self, result):
            result (list): The result data to sync.

        Return:
-            True if reduce result data successfully.
+            Result if reduce result data successfully, otherwise None.
        """
-        if not super()._sync_result(result):
-            return False
+        result = super()._sync_result(result)
+        if not result:
+            return None

        try:
            if self._args.distributed_impl == DistributedImpl.DDP:
@@ -206,7 +236,7 @@ def _sync_result(self, result):
                    tensor = torch.as_tensor(result).cuda()
                else:
                    tensor = torch.as_tensor(result)
-                torch.distributed.reduce(tensor, 0, op=torch.distributed.ReduceOp.MAX)
+                torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.MAX)
                result = tensor.tolist()
        except BaseException as e:
            logger.error(
@@ -214,9 +244,9 @@ def _sync_result(self, result):
                    self._name, self._args.distributed_impl, str(e)
                )
            )
-            return False
+            return None

-        return True
+        return result

    def _postprocess(self):
        """Postprocess/cleanup operations after the benchmarking.
@@ -257,3 +287,16 @@ def _cal_params_count(self):
            The count of trainable parameters.
        """
        return sum(p.numel() for p in self._model.parameters() if p.requires_grad)
+
+    def _timer(self):
+        """Returns the current time which ensures all previous CUDA events have been finished.
+
+        If there is no GPU present, this defaults to `time.time()`; otherwise it will
+        synchronize CUDA before measuring the time.
+
+        Returns:
+            Current time in second.
+        """
+        if self._gpu_available:
+            torch.cuda.synchronize()
+        return time.time()
--- a/superbench/benchmarks/model_benchmarks/pytorch_bert.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_bert.py
@@ -3,8 +3,6 @@

 """Module of the Pytorch BERT model."""

-import time
-
 import torch
 from transformers import BertModel, BertConfig

@@ -137,9 +135,10 @@ def _train_step(self, precision):
        """
        duration = []
        curr_step = 0
+        check_frequency = 100
        while True:
            for idx, sample in enumerate(self._dataloader):
-                start = time.time()
+                start = self._timer()
                if self._gpu_available:
                    sample = sample.cuda()
                self._optimizer.zero_grad()
@@ -147,12 +146,12 @@ def _train_step(self, precision):
                loss = self._loss_fn(output, self._target)
                loss.backward()
                self._optimizer.step()
-                end = time.time()
+                end = self._timer()
                curr_step += 1
                if curr_step > self._args.num_warmup:
                    # Save the step time of every training/inference step, unit is millisecond.
                    duration.append((end - start) * 1000)
-                if self._is_finished(curr_step, end):
+                if self._is_finished(curr_step, end, check_frequency):
                    return duration

    def _inference_step(self, precision):
@@ -171,13 +170,11 @@ def _inference_step(self, precision):
            self._model.eval()
            while True:
                for idx, sample in enumerate(self._dataloader):
-                    start = time.time()
+                    start = self._timer()
                    if self._gpu_available:
                        sample = sample.cuda()
                    self._model(sample)
-                    if self._gpu_available:
-                        torch.cuda.synchronize()
-                    end = time.time()
+                    end = self._timer()
                    curr_step += 1
                    if curr_step > self._args.num_warmup:
                        # Save the step time of every training/inference step, unit is millisecond.

--- a/superbench/benchmarks/model_benchmarks/pytorch_cnn.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_cnn.py
@@ -3,8 +3,6 @@

 """Module of the Pytorch CNN models."""

-import time
-
 import torch
 from torchvision import models

@@ -99,10 +97,11 @@ def _train_step(self, precision):
        """
        duration = []
        curr_step = 0
+        check_frequency = 100
        while True:
            for idx, sample in enumerate(self._dataloader):
                sample = sample.to(dtype=getattr(torch, precision.value))
-                start = time.time()
+                start = self._timer()
                if self._gpu_available:
                    sample = sample.cuda()
                self._optimizer.zero_grad()
@@ -110,12 +109,12 @@ def _train_step(self, precision):
                loss = self._loss_fn(output, self._target)
                loss.backward()
                self._optimizer.step()
-                end = time.time()
+                end = self._timer()
                curr_step += 1
                if curr_step > self._args.num_warmup:
                    # Save the step time of every training/inference step, unit is millisecond.
                    duration.append((end - start) * 1000)
-                if self._is_finished(curr_step, end):
+                if self._is_finished(curr_step, end, check_frequency):
                    return duration

    def _inference_step(self, precision):
@@ -135,13 +134,11 @@ def _inference_step(self, precision):
            while True:
                for idx, sample in enumerate(self._dataloader):
                    sample = sample.to(dtype=getattr(torch, precision.value))
-                    start = time.time()
+                    start = self._timer()
                    if self._gpu_available:
                        sample = sample.cuda()
                    self._model(sample)
-                    if self._gpu_available:
-                        torch.cuda.synchronize()
-                    end = time.time()
+                    end = self._timer()
                    curr_step += 1
                    if curr_step > self._args.num_warmup:
                        # Save the step time of every training/inference step, unit is millisecond.

--- a/superbench/benchmarks/model_benchmarks/pytorch_gpt2.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_gpt2.py
@@ -3,8 +3,6 @@

 """Module of the Pytorch GPT2 model."""

-import time
-
 import torch
 from transformers import GPT2Model, GPT2Config

@@ -131,9 +129,10 @@ def _train_step(self, precision):
        """
        duration = []
        curr_step = 0
+        check_frequency = 100
        while True:
            for idx, sample in enumerate(self._dataloader):
-                start = time.time()
+                start = self._timer()
                if self._gpu_available:
                    sample = sample.cuda()
                self._optimizer.zero_grad()
@@ -141,12 +140,12 @@ def _train_step(self, precision):
                loss = self._loss_fn(output[range(self._args.batch_size), -1], self._target)
                loss.backward()
                self._optimizer.step()
-                end = time.time()
+                end = self._timer()
                curr_step += 1
                if curr_step > self._args.num_warmup:
                    # Save the step time of every training/inference step, unit is millisecond.
                    duration.append((end - start) * 1000)
-                if self._is_finished(curr_step, end):
+                if self._is_finished(curr_step, end, check_frequency):
                    return duration

    def _inference_step(self, precision):
@@ -165,13 +164,11 @@ def _inference_step(self, precision):
            self._model.eval()
            while True:
                for idx, sample in enumerate(self._dataloader):
-                    start = time.time()
+                    start = self._timer()
                    if self._gpu_available:
                        sample = sample.cuda()
                    self._model(sample)
-                    if self._gpu_available:
-                        torch.cuda.synchronize()
-                    end = time.time()
+                    end = self._timer()
                    curr_step += 1
                    if curr_step > self._args.num_warmup:
                        # Save the step time of every training/inference step, unit is millisecond.

--- a/superbench/benchmarks/model_benchmarks/pytorch_lstm.py
+++ b/superbench/benchmarks/model_benchmarks/pytorch_lstm.py
@@ -3,8 +3,6 @@

 """Module of the Pytorch LSTM model."""

-import time
-
 import torch

 from superbench.common.utils import logger
@@ -139,10 +137,11 @@ def _train_step(self, precision):
        """
        duration = []
        curr_step = 0
+        check_frequency = 100
        while True:
            for idx, sample in enumerate(self._dataloader):
                sample = sample.to(dtype=getattr(torch, precision.value))
-                start = time.time()
+                start = self._timer()
                if self._gpu_available:
                    sample = sample.cuda()
                self._optimizer.zero_grad()
@@ -150,12 +149,12 @@ def _train_step(self, precision):
                loss = self._loss_fn(output, self._target)
                loss.backward()
                self._optimizer.step()
-                end = time.time()
+                end = self._timer()
                curr_step += 1
                if curr_step > self._args.num_warmup:
                    # Save the step time of every training/inference step, unit is millisecond.
                    duration.append((end - start) * 1000)
-                if self._is_finished(curr_step, end):
+                if self._is_finished(curr_step, end, check_frequency):
                    return duration

    def _inference_step(self, precision):
@@ -175,13 +174,11 @@ def _inference_step(self, precision):
            while True:
                for idx, sample in enumerate(self._dataloader):
                    sample = sample.to(dtype=getattr(torch, precision.value))
-                    start = time.time()
+                    start = self._timer()
                    if self._gpu_available:
                        sample = sample.cuda()
                    self._model(sample)
-                    if self._gpu_available:
-                        torch.cuda.synchronize()
-                    end = time.time()
+                    end = self._timer()
                    curr_step += 1
                    if curr_step > self._args.num_warmup:
                        # Save the step time of every training/inference step, unit is millisecond.

--- a/superbench/config/amd_mi100_hpe.yaml
+++ b/superbench/config/amd_mi100_hpe.yaml
@@ -3,7 +3,7 @@
 # Server:
 #   - Product: HPE Apollo 6500

-version: v0.4
+version: v0.5
 superbench:
  enable: null
  var:

--- a/superbench/config/amd_mi100_z53.yaml
+++ b/superbench/config/amd_mi100_z53.yaml
@@ -4,7 +4,7 @@
 #   - Product: G482-Z53
 #   - Link: https://www.gigabyte.cn/FileUpload/Global/MicroSite/553/G482-Z53.html

-version: v0.4
+version: v0.5
 superbench:
  enable: null
  var:

--- a/superbench/config/azure/inference/nc64as_t4_v3.yaml
+++ b/superbench/config/azure/inference/nc64as_t4_v3.yaml
-version: v0.4
+version: v0.5
 superbench:
  enable: null
  monitor: