Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283)

**Description** Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.

Benchmarks: Add Feature - Add percentile metrics for ort and pytorch inference benchmarks (#283)
**Description** Add 50th, 90th, 95th, 99th, 99.9th latency metrics for ORT and pytorch inference benchmarks.
fd2bc9e0 · guoshzhao · GitHub · f7ffc545 · fd2bc9e0 · fd2bc9e0
Unverified Commit fd2bc9e0 authored Jan 19, 2022 by guoshzhao Committed by GitHub Jan 19, 2022
9 changed files
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -133,11 +133,15 @@ Inference performance of the torchvision models using ONNXRuntime. Currently the
 > resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0,
 > squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
+The supported percentiles are 50, 90, 95, 99, and 99.9.
 #### Metrics
-| Name                                          | Unit      | Description                                               |
+| Name                                                    | Unit      | Description                                                                 |
-|-----------------------------------------------|-----------|-----------------------------------------------------------|
+|---------------------------------------------------------|-----------|-----------------------------------------------------------------------------|
-| ort-inference/{precision}_{model}_time        | time (ms) | The mean latency to execute one batch of inference.       |
+| ort-inference/{precision}_{model}_time                  | time (ms) | The mean latency to execute one batch of inference.                         |
+| ort-inference/{precision}_{model}_time_{percentile}     | time (ms) | The {percentile}th percentile latency to execute one batch of inference.    |
 ## Communication Benchmarks

--- a/docs/user-tutorial/benchmarks/model-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/model-benchmarks.md
@@ -12,57 +12,60 @@ id: model-benchmarks
 Run training or inference tasks with single or half precision for GPT models,
 including gpt2-small, gpt2-medium, gpt2-large and gpt2-xl.
+The supported percentiles are 50, 90, 95, 99, and 99.9.
 #### Metrics
-| Name                                                       | Unit                   | Description                                 |
+| Name                                                                    | Unit                   | Description                                                               |
-|------------------------------------------------------------|------------------------|---------------------------------------------|
+|-------------------------------------------------------------------------|------------------------|---------------------------------------------------------------------------|
-| gpt_models/pytorch-${model_name}/fp32_train_step_time      | time (ms)              | Train step time with single precision.      |
+| gpt_models/pytorch-${model_name}/fp32_train_step_time                   | time (ms)              | The average training step time with single precision.                     |
-| gpt_models/pytorch-${model_name}/fp32_train_throughput     | throughput (samples/s) | Train throughput with single precision.     |
+| gpt_models/pytorch-${model_name}/fp32_train_throughput                  | throughput (samples/s) | The average training throughput with single precision.                    |
-| gpt_models/pytorch-${model_name}/fp32_inference_step_time  | time (ms)              | Inference step time with single precision.  |
+| gpt_models/pytorch-${model_name}/fp32_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with single precision.  |
-| gpt_models/pytorch-${model_name}/fp32_inference_throughput | throughput (samples/s) | Inference throughput with single precision. |
+| gpt_models/pytorch-${model_name}/fp32_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with single precision. |
-| gpt_models/pytorch-${model_name}/fp16_train_step_time      | time (ms)              | Train step time with half precision.        |
+| gpt_models/pytorch-${model_name}/fp16_train_step_time                   | time (ms)              | The average training step time with half precision.                       |
-| gpt_models/pytorch-${model_name}/fp16_train_throughput     | throughput (samples/s) | Train throughput with half precision.       |
+| gpt_models/pytorch-${model_name}/fp16_train_throughput                  | throughput (samples/s) | The average training throughput with half precision.                      |
-| gpt_models/pytorch-${model_name}/fp16_inference_step_time  | time (ms)              | Inference step time with half precision.    |
+| gpt_models/pytorch-${model_name}/fp16_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with half precision.    |
-| gpt_models/pytorch-${model_name}/fp16_inference_throughput | throughput (samples/s) | Inference throughput with half precision.   |
+| gpt_models/pytorch-${model_name}/fp16_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with half precision.   |
 ### `bert_models`
 #### Introduction
 Run training or inference tasks with single or half precision for BERT models, including bert-base and bert-large.
+The supported percentiles are 50, 90, 95, 99, and 99.9.
 #### Metrics
-| Name                                                        | Unit                   | Description                                 |
+| Name                                                                     | Unit                   | Description                                                               |
-|-------------------------------------------------------------|------------------------|---------------------------------------------|
+|--------------------------------------------------------------------------|------------------------|---------------------------------------------------------------------------|
-| bert_models/pytorch-${model_name}/fp32_train_step_time      | time (ms)              | Train step time with single precision.      |
+| bert_models/pytorch-${model_name}/fp32_train_step_time                   | time (ms)              | The average training step time with single precision.                     |
-| bert_models/pytorch-${model_name}/fp32_train_throughput     | throughput (samples/s) | Train throughput with single precision.     |
+| bert_models/pytorch-${model_name}/fp32_train_throughput                  | throughput (samples/s) | The average training throughput with single precision.                    |
-| bert_models/pytorch-${model_name}/fp32_inference_step_time  | time (ms)              | Inference step time with single precision.  |
+| bert_models/pytorch-${model_name}/fp32_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with single precision.  |
-| bert_models/pytorch-${model_name}/fp32_inference_throughput | throughput (samples/s) | Inference throughput with single precision. |
+| bert_models/pytorch-${model_name}/fp32_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with single precision. |
-| bert_models/pytorch-${model_name}/fp16_train_step_time      | time (ms)              | Train step time with half precision.        |
+| bert_models/pytorch-${model_name}/fp16_train_step_time                   | time (ms)              | The average training step time with half precision.                       |
-| bert_models/pytorch-${model_name}/fp16_train_throughput     | throughput (samples/s) | Train throughput with half precision.       |
+| bert_models/pytorch-${model_name}/fp16_train_throughput                  | throughput (samples/s) | The average training throughput with half precision.                      |
-| bert_models/pytorch-${model_name}/fp16_inference_step_time  | time (ms)              | Inference step time with half precision.    |
+| bert_models/pytorch-${model_name}/fp16_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with half precision.    |
-| bert_models/pytorch-${model_name}/fp16_inference_throughput | throughput (samples/s) | Inference throughput with half precision.   |
+| bert_models/pytorch-${model_name}/fp16_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with half precision.   |
 ### `lstm_models`
 #### Introduction
 Run training or inference tasks with single or half precision for one bidirectional LSTM model.
+The supported percentiles are 50, 90, 95, 99, and 99.9.
 #### Metrics
-| Name                                               | Unit                   | Description                                 |
+| Name                                                            | Unit                   | Description                                                               |
-|----------------------------------------------------|------------------------|---------------------------------------------|
+|-----------------------------------------------------------------|------------------------|---------------------------------------------------------------------------|
-| lstm_models/pytorch-lstm/fp32_train_step_time      | time (ms)              | Train step time with single precision.      |
+| lstm_models/pytorch-lstm/fp32_train_step_time                   | time (ms)              | The average training step time with single precision.                     |
-| lstm_models/pytorch-lstm/fp32_train_throughput     | throughput (samples/s) | Train throughput with single precision.     |
+| lstm_models/pytorch-lstm/fp32_train_throughput                  | throughput (samples/s) | The average training throughput with single precision.                    |
-| lstm_models/pytorch-lstm/fp32_inference_step_time  | time (ms)              | Inference step time with single precision.  |
+| lstm_models/pytorch-lstm/fp32_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with single precision.  |
-| lstm_models/pytorch-lstm/fp32_inference_throughput | throughput (samples/s) | Inference throughput with single precision. |
+| lstm_models/pytorch-lstm/fp32_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with single precision. |
-| lstm_models/pytorch-lstm/fp16_train_step_time      | time (ms)              | Train step time with half precision.        |
+| lstm_models/pytorch-lstm/fp16_train_step_time                   | time (ms)              | The average training step time with half precision.                       |
-| lstm_models/pytorch-lstm/fp16_train_throughput     | throughput (samples/s) | Train throughput with half precision.       |
+| lstm_models/pytorch-lstm/fp16_train_throughput                  | throughput (samples/s) | The average training throughput with half precision.                      |
-| lstm_models/pytorch-lstm/fp16_inference_step_time  | time (ms)              | Inference step time with half precision.    |
+| lstm_models/pytorch-lstm/fp16_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with half precision.    |
-| lstm_models/pytorch-lstm/fp16_inference_throughput | throughput (samples/s) | Inference throughput with half precision.   |
+| lstm_models/pytorch-lstm/fp16_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with half precision.   |
 ### `cnn_models`
@@ -80,16 +83,17 @@ Run training or inference tasks with single or half precision for CNN models lis
 * shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
 * squeezenet: squeezenet1_0, squeezenet1_1
 * others: alexnet, googlenet, inception_v3
+The supported percentiles are 50, 90, 95, 99, and 99.9.
 #### Metrics
-| Name                                                       | Unit                   | Description                                 |
+| Name                                                                    | Unit                   | Description                                                               |
-|------------------------------------------------------------|------------------------|---------------------------------------------|
+|-------------------------------------------------------------------------|------------------------|---------------------------------------------------------------------------|
-| cnn_models/pytorch-${model_name}/fp32_train_step_time      | time (ms)              | Train step time with single precision.      |
+| cnn_models/pytorch-${model_name}/fp32_train_step_time                   | time (ms)              | Train average step time with single precision.                            |
-| cnn_models/pytorch-${model_name}/fp32_train_throughput     | throughput (samples/s) | Train throughput with single precision.     |
+| cnn_models/pytorch-${model_name}/fp32_train_throughput                  | throughput (samples/s) | Train average throughput with single precision.                           |
-| cnn_models/pytorch-${model_name}/fp32_inference_step_time  | time (ms)              | Inference step time with single precision.  |
+| cnn_models/pytorch-${model_name}/fp32_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with single precision.  |
-| cnn_models/pytorch-${model_name}/fp32_inference_throughput | throughput (samples/s) | Inference throughput with single precision. |
+| cnn_models/pytorch-${model_name}/fp32_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with single precision. |
-| cnn_models/pytorch-${model_name}/fp16_train_step_time      | time (ms)              | Train step time with half precision.        |
+| cnn_models/pytorch-${model_name}/fp16_train_step_time                   | time (ms)              | Train average step time with half precision.                              |
-| cnn_models/pytorch-${model_name}/fp16_train_throughput     | throughput (samples/s) | Train throughput with half precision.       |
+| cnn_models/pytorch-${model_name}/fp16_train_throughput                  | throughput (samples/s) | Train average throughput with half precision.                             |
-| cnn_models/pytorch-${model_name}/fp16_inference_step_time  | time (ms)              | Inference step time with half precision.    |
+| cnn_models/pytorch-${model_name}/fp16_inference_step_time_{percentile}  | time (ms)              | The {percentile}th percentile inference step time with half precision.    |
-| cnn_models/pytorch-${model_name}/fp16_inference_throughput | throughput (samples/s) | Inference throughput with half precision.   |
+| cnn_models/pytorch-${model_name}/fp16_inference_throughput_{percentile} | throughput (samples/s) | The {percentile}th percentile inference throughput with half precision.   |
--- a/setup.py
+++ b/setup.py
@@ -142,6 +142,7 @@ def run(self):
        'knack>=0.7.2',
        'matplotlib>=3.0.0',
        'natsort>=7.1.1',
+        'numpy>=1.19.2',
        'openpyxl>=3.0.7',
        'omegaconf==2.0.6',
        'pandas>=1.1.5',

--- a/superbench/benchmarks/base.py
+++ b/superbench/benchmarks/base.py
@@ -9,6 +9,8 @@
 from operator import attrgetter
 from abc import ABC, abstractmethod
+import numpy as np
 from superbench.common.utils import logger
 from superbench.benchmarks import BenchmarkType, ReturnCode
 from superbench.benchmarks.result import BenchmarkResult
@@ -246,6 +248,22 @@ def __check_raw_data(self):
        return True
+    def _process_percentile_result(self, metric, result, reduce_type=None):
+        """Function to process the percentile results.
+        Args:
+            metric (str): metric name which is the key.
+            result (List[numbers.Number]): numerical result.
+            reduce_type (ReduceType): The type of reduce function.
+        """
+        if len(result) > 0:
+            percentile_list = ['50', '90', '95', '99', '99.9']
+            for percentile in percentile_list:
+                self._result.add_result(
+                    '{}_{}'.format(metric, percentile),
+                    np.percentile(result, float(percentile), interpolation='nearest'), reduce_type
+                )
    def print_env_info(self):
        """Print environments or dependencies information."""
        # TODO: will implement it when add real benchmarks in the future.

--- a/superbench/benchmarks/micro_benchmarks/micro_base.py
+++ b/superbench/benchmarks/micro_benchmarks/micro_base.py
@@ -49,13 +49,14 @@ def _benchmark(self):
        """
        pass
-    def _process_numeric_result(self, metric, result, reduce_type=None):
+    def _process_numeric_result(self, metric, result, reduce_type=None, cal_percentile=False):
        """Function to save the numerical results.
        Args:
            metric (str): metric name which is the key.
            result (List[numbers.Number]): numerical result.
            reduce_type (ReduceType): The type of reduce function.
+            cal_percentile (bool): Whether to calculate the percentile results.
        Return:
            True if result list is not empty.
@@ -70,6 +71,8 @@ def _process_numeric_result(self, metric, result, reduce_type=None):
        self._result.add_raw_data(metric, result)
        self._result.add_result(metric, statistics.mean(result), reduce_type)
+        if cal_percentile:
+            self._process_percentile_result(metric, result, reduce_type)
        return True

--- a/superbench/benchmarks/micro_benchmarks/ort_inference_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/ort_inference_performance.py
@@ -156,7 +156,7 @@ def _benchmark(self):
            else:
                precision = self._args.precision.value
            metric = '{}_{}_time'.format(precision, model)
-            if not self._process_numeric_result(metric, elapse_times):
+            if not self._process_numeric_result(metric, elapse_times, cal_percentile=True):
                return False
            logger.info(

--- a/superbench/benchmarks/model_benchmarks/model_base.py
+++ b/superbench/benchmarks/model_benchmarks/model_base.py
@@ -377,17 +377,21 @@ def __process_model_result(self, model_action, precision, step_times):
        if precision.value in precision_metric.keys():
            precision = precision_metric[precision.value]
        metric = '{}_{}_step_time'.format(precision, model_action)
+        reduce_type = ReduceType.MAX if model_action is ModelAction.TRAIN else None
        self._result.add_raw_data(metric, step_times)
-        avg = statistics.mean(step_times)
+        self._result.add_result(metric, statistics.mean(step_times), reduce_type=reduce_type)
-        self._result.add_result(metric, avg, reduce_type=ReduceType.MAX if model_action is ModelAction.TRAIN else None)
+        if model_action == ModelAction.INFERENCE:
+            self._process_percentile_result(metric, step_times, reduce_type=reduce_type)
        # The unit of step time is millisecond, use it to calculate the throughput with the unit samples/sec.
        millisecond_per_second = 1000
        throughput = [millisecond_per_second / step_time * self._args.batch_size for step_time in step_times]
        metric = '{}_{}_throughput'.format(precision, model_action)
+        reduce_type = ReduceType.MIN if model_action is ModelAction.TRAIN else None
        self._result.add_raw_data(metric, throughput)
-        avg = statistics.mean(throughput)
+        self._result.add_result(metric, statistics.mean(throughput), reduce_type=reduce_type)
-        self._result.add_result(metric, avg, reduce_type=ReduceType.MIN if model_action is ModelAction.TRAIN else None)
+        if model_action == ModelAction.INFERENCE:
+            self._process_percentile_result(metric, throughput, reduce_type=reduce_type)
        return True

--- a/tests/benchmarks/micro_benchmarks/test_micro_base.py
+++ b/tests/benchmarks/micro_benchmarks/test_micro_base.py
@@ -96,6 +96,17 @@ def test_micro_benchmark_base():
    assert (benchmark.result['metric1'] == [3.5])
    assert (benchmark.raw_data['metric1'] == [[1, 2, 3, 4, 5, 6]])
+    benchmark._result._BenchmarkResult__result = dict()
+    benchmark._result._BenchmarkResult__raw_data = dict()
+    benchmark._process_numeric_result('metric1', [1, 3, 4, 2, 6, 5], cal_percentile=True)
+    assert (benchmark.result['metric1'] == [3.5])
+    assert (benchmark.result['metric1_50'] == [3])
+    assert (benchmark.result['metric1_90'] == [5])
+    assert (benchmark.result['metric1_95'] == [6])
+    assert (benchmark.result['metric1_99'] == [6])
+    assert (benchmark.result['metric1_99.9'] == [6])
+    assert (benchmark.raw_data['metric1'] == [[1, 3, 4, 2, 6, 5]])
 def test_micro_benchmark_with_invoke_base():
    """Test MicroBenchmarkWithInvoke."""

--- a/tests/benchmarks/model_benchmarks/test_model_base.py
+++ b/tests/benchmarks/model_benchmarks/test_model_base.py
@@ -252,9 +252,21 @@ def test_inference():
        '"start_time": null, "end_time": null, "raw_data": {'
        '"fp16_inference_step_time": [[4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]], '
        '"fp16_inference_throughput": [[8000.0, 8000.0, 8000.0, 8000.0, 8000.0, 8000.0, 8000.0, 8000.0]]}, '
-        '"result": {"return_code": [0], '
+        '"result": {"return_code": [0], "fp16_inference_step_time": [4.0], '
-        '"fp16_inference_step_time": [4.0], "fp16_inference_throughput": [8000.0]}, '
+        '"fp16_inference_step_time_50": [4.0], "fp16_inference_step_time_90": [4.0], '
-        '"reduce_op": {"return_code": null, "fp16_inference_step_time": null, "fp16_inference_throughput": null}}'
+        '"fp16_inference_step_time_95": [4.0], "fp16_inference_step_time_99": [4.0], '
+        '"fp16_inference_step_time_99.9": [4.0], '
+        '"fp16_inference_throughput": [8000.0], '
+        '"fp16_inference_throughput_50": [8000.0], "fp16_inference_throughput_90": [8000.0], '
+        '"fp16_inference_throughput_95": [8000.0], "fp16_inference_throughput_99": [8000.0], '
+        '"fp16_inference_throughput_99.9": [8000.0]}, '
+        '"reduce_op": {"return_code": null, "fp16_inference_step_time": null, '
+        '"fp16_inference_step_time_50": null, "fp16_inference_step_time_90": null, '
+        '"fp16_inference_step_time_95": null, "fp16_inference_step_time_99": null, '
+        '"fp16_inference_step_time_99.9": null, "fp16_inference_throughput": null, '
+        '"fp16_inference_throughput_50": null, "fp16_inference_throughput_90": null, '
+        '"fp16_inference_throughput_95": null, "fp16_inference_throughput_99": null, '
+        '"fp16_inference_throughput_99.9": null}}'
    )
    assert (benchmark._preprocess())
    assert (benchmark._ModelBenchmark__inference(Precision.FLOAT16))