unify name speed up and speedup to speedup (#4689)

e8b88a79 · J-shang · GitHub · c5066cda · e8b88a79 · c5066cda
Unverified Commit e8b88a79 authored Mar 28, 2022 by J-shang Committed by GitHub Mar 28, 2022
20 changed files
--- a/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/pruning_speed_up.py.md5
+++ b/docs/source/tutorials/pruning_speed_up.py.md5
-e7a923e9f98f16e2eb4f3c29c2940f49
\ No newline at end of file
--- a/docs/source/tutorials/pruning_speed_up_codeobj.pickle
+++ b/docs/source/tutorials/pruning_speed_up_codeobj.pickle
--- a/docs/source/tutorials/pruning_speed_up.ipynb
+++ b/docs/source/tutorials/pruning_speed_up.ipynb
@@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Speed Up Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
+        "\n# Speedup Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
      ]
    },
    {
@@ -76,7 +76,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "Speed up the model and show the model structure after speed up.\n\n"
+        "Speedup the model and show the model structure after speedup.\n\n"
      ]
    },
    {
@@ -94,7 +94,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "Roughly test the model after speed-up inference speed.\n\n"
+        "Roughly test the model after speedup inference speed.\n\n"
      ]
    },
    {
@@ -112,7 +112,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,\nplease refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.\n\nNOTE: The current implementation supports PyTorch 1.3.1 or newer.\n\n## Limitations\n\nFor PyTorch we can only replace modules, if functions in ``forward`` should be replaced,\nour current implementation does not work. One workaround is make the function a PyTorch module.\n\nIf you want to speed up your own model which cannot supported by the current implementation,\nyou need implement the replace function for module replacement, welcome to contribute.\n\n## Speedup Results of Examples\n\nThe code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.\n\nThese result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.\n\n### slim pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01197\n     - 0.005107\n   * - 2\n     - 0.02019\n     - 0.008769\n   * - 4\n     - 0.02733\n     - 0.014809\n   * - 8\n     - 0.04310\n     - 0.027441\n   * - 16\n     - 0.07731\n     - 0.05008\n   * - 32\n     - 0.14464\n     - 0.10027\n\n### fpgm pruner example\n\non cpu,\ninput tensor: ``torch.randn(64, 1, 28, 28)``\\ ,\ntoo large variance\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01383\n     - 0.01839\n   * - 2\n     - 0.01167\n     - 0.003558\n   * - 4\n     - 0.01636\n     - 0.01088\n   * - 40\n     - 0.14412\n     - 0.08268\n   * - 40\n     - 1.29385\n     - 0.14408\n   * - 40\n     - 0.41035\n     - 0.46162\n   * - 400\n     - 6.29020\n     - 5.82143\n\n### l1filter pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01026\n     - 0.003677\n   * - 2\n     - 0.01657\n     - 0.008161\n   * - 4\n     - 0.02458\n     - 0.020018\n   * - 8\n     - 0.03498\n     - 0.025504\n   * - 16\n     - 0.06757\n     - 0.047523\n   * - 32\n     - 0.10487\n     - 0.086442\n\n### APoZ pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01389\n     - 0.004208\n   * - 2\n     - 0.01628\n     - 0.008310\n   * - 4\n     - 0.02521\n     - 0.014008\n   * - 8\n     - 0.03386\n     - 0.023923\n   * - 16\n     - 0.06042\n     - 0.046183\n   * - 32\n     - 0.12421\n     - 0.087113\n\n### SimulatedAnnealing pruner example\n\nIn this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.\nWe measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.\nThe latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.\n\n<img src=\"file://../../img/SA_latency_accuracy.png\">\n\n"
+        "For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,\nplease refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.\n\nNOTE: The current implementation supports PyTorch 1.3.1 or newer.\n\n## Limitations\n\nFor PyTorch we can only replace modules, if functions in ``forward`` should be replaced,\nour current implementation does not work. One workaround is make the function a PyTorch module.\n\nIf you want to speedup your own model which cannot supported by the current implementation,\nyou need implement the replace function for module replacement, welcome to contribute.\n\n## Speedup Results of Examples\n\nThe code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.\n\nThese result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.\n\n### slim pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01197\n     - 0.005107\n   * - 2\n     - 0.02019\n     - 0.008769\n   * - 4\n     - 0.02733\n     - 0.014809\n   * - 8\n     - 0.04310\n     - 0.027441\n   * - 16\n     - 0.07731\n     - 0.05008\n   * - 32\n     - 0.14464\n     - 0.10027\n\n### fpgm pruner example\n\non cpu,\ninput tensor: ``torch.randn(64, 1, 28, 28)``\\ ,\ntoo large variance\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01383\n     - 0.01839\n   * - 2\n     - 0.01167\n     - 0.003558\n   * - 4\n     - 0.01636\n     - 0.01088\n   * - 40\n     - 0.14412\n     - 0.08268\n   * - 40\n     - 1.29385\n     - 0.14408\n   * - 40\n     - 0.41035\n     - 0.46162\n   * - 400\n     - 6.29020\n     - 5.82143\n\n### l1filter pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01026\n     - 0.003677\n   * - 2\n     - 0.01657\n     - 0.008161\n   * - 4\n     - 0.02458\n     - 0.020018\n   * - 8\n     - 0.03498\n     - 0.025504\n   * - 16\n     - 0.06757\n     - 0.047523\n   * - 32\n     - 0.10487\n     - 0.086442\n\n### APoZ pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01389\n     - 0.004208\n   * - 2\n     - 0.01628\n     - 0.008310\n   * - 4\n     - 0.02521\n     - 0.014008\n   * - 8\n     - 0.03386\n     - 0.023923\n   * - 16\n     - 0.06042\n     - 0.046183\n   * - 32\n     - 0.12421\n     - 0.087113\n\n### SimulatedAnnealing pruner example\n\nIn this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.\nWe measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.\nThe latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.\n\n<img src=\"file://../../img/SA_latency_accuracy.png\">\n\n"
      ]
    }
  ],

--- a/examples/tutorials/pruning_speed_up.py
+++ b/examples/tutorials/pruning_speed_up.py
 """
-Speed Up Model with Mask
+Speedup Model with Mask
 ========================

 Introduction
@@ -12,16 +12,16 @@ to convert a model to a smaller one based on user provided masks (the masks come
 pruning algorithms).

 There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
-and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
+and input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.
 The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
-To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+To speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
 Since the support of sparse kernels in community is limited,
 we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.

 Design and Implementation
 -------------------------

-To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+To speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
 or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
 thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
 Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
@@ -64,13 +64,13 @@ model(torch.rand(128, 1, 28, 28).to(device))
 print('Original Model - Elapsed Time : ', time.time() - start)

 # %%
-# Speed up the model and show the model structure after speed up.
+# Speedup the model and show the model structure after speedup.
 from nni.compression.pytorch import ModelSpeedup
 ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
 print(model)

 # %%
-# Roughly test the model after speed-up inference speed.
+# Roughly test the model after speedup inference speed.
 start = time.time()
 model(torch.rand(128, 1, 28, 28).to(device))
 print('Speedup Model - Elapsed Time : ', time.time() - start)
@@ -87,7 +87,7 @@ print('Speedup Model - Elapsed Time : ', time.time() - start)
 # For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
 # our current implementation does not work. One workaround is make the function a PyTorch module.
 #
-# If you want to speed up your own model which cannot supported by the current implementation,
+# If you want to speedup your own model which cannot supported by the current implementation,
 # you need implement the replace function for module replacement, welcome to contribute.
 #
 # Speedup Results of Examples

--- a/docs/source/tutorials/pruning_speedup.py.md5
+++ b/docs/source/tutorials/pruning_speedup.py.md5
+b5fa19199a998cec748c5a3a1479374f
\ No newline at end of file
--- a/docs/source/tutorials/pruning_speed_up.rst
+++ b/docs/source/tutorials/pruning_speed_up.rst
@@ -2,7 +2,7 @@
 .. DO NOT EDIT.
 .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
 .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
-.. "tutorials/pruning_speed_up.py"
+.. "tutorials/pruning_speedup.py"
 .. LINE NUMBERS ARE GIVEN BELOW.

 .. only:: html
@@ -10,15 +10,15 @@
    .. note::
        :class: sphx-glr-download-link-note

-        Click :ref:`here <sphx_glr_download_tutorials_pruning_speed_up.py>`
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_speedup.py>`
        to download the full example code

 .. rst-class:: sphx-glr-example-title

-.. _sphx_glr_tutorials_pruning_speed_up.py:
+.. _sphx_glr_tutorials_pruning_speedup.py:


-Speed Up Model with Mask
+Speedup Model with Mask
 ========================

 Introduction
@@ -31,16 +31,16 @@ to convert a model to a smaller one based on user provided masks (the masks come
 pruning algorithms).

 There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
-and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
+and input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.
 The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
-To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+To speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
 Since the support of sparse kernels in community is limited,
 we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.

 Design and Implementation
 -------------------------

-To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+To speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
 or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
 thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
 Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
@@ -136,14 +136,14 @@ Roughly test the original model inference speed.

 .. code-block:: none

-    Original Model - Elapsed Time :  0.10696005821228027
+    Original Model - Elapsed Time :  0.13896703720092773




 .. GENERATED FROM PYTHON SOURCE LINES 67-68

-Speed up the model and show the model structure after speed up.
+Speedup the model and show the model structure after speedup.

 .. GENERATED FROM PYTHON SOURCE LINES 68-72

@@ -180,7 +180,7 @@ Speed up the model and show the model structure after speed up.

 .. GENERATED FROM PYTHON SOURCE LINES 73-74

-Roughly test the model after speed-up inference speed.
+Roughly test the model after speedup inference speed.

 .. GENERATED FROM PYTHON SOURCE LINES 74-78

@@ -200,7 +200,7 @@ Roughly test the model after speed-up inference speed.

 .. code-block:: none

-    Speedup Model - Elapsed Time :  0.002137899398803711
+    Speedup Model - Elapsed Time :  0.003123760223388672



@@ -218,7 +218,7 @@ Limitations
 For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
 our current implementation does not work. One workaround is make the function a PyTorch module.

-If you want to speed up your own model which cannot supported by the current implementation,
+If you want to speedup your own model which cannot supported by the current implementation,
 you need implement the replace function for module replacement, welcome to contribute.

 Speedup Results of Examples
@@ -372,10 +372,10 @@ The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(1

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 0 minutes  9.859 seconds)
+   **Total running time of the script:** ( 0 minutes  12.486 seconds)


-.. _sphx_glr_download_tutorials_pruning_speed_up.py:
+.. _sphx_glr_download_tutorials_pruning_speedup.py:


 .. only :: html
@@ -387,13 +387,13 @@ The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(1

  .. container:: sphx-glr-download sphx-glr-download-python

-     :download:`Download Python source code: pruning_speed_up.py <pruning_speed_up.py>`
+     :download:`Download Python source code: pruning_speedup.py <pruning_speedup.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

-     :download:`Download Jupyter notebook: pruning_speed_up.ipynb <pruning_speed_up.ipynb>`
+     :download:`Download Jupyter notebook: pruning_speedup.ipynb <pruning_speedup.ipynb>`


 .. only:: html

--- a/docs/source/tutorials/pruning_speedup_codeobj.pickle
+++ b/docs/source/tutorials/pruning_speedup_codeobj.pickle
--- a/docs/source/tutorials/quantization_speed_up.py.md5
+++ b/docs/source/tutorials/quantization_speed_up.py.md5
-07fe95336a0d7edb8924dc24b609b361
\ No newline at end of file
--- a/docs/source/tutorials/quantization_speed_up_codeobj.pickle
+++ b/docs/source/tutorials/quantization_speed_up_codeobj.pickle
--- a/docs/source/tutorials/quantization_speed_up.ipynb
+++ b/docs/source/tutorials/quantization_speed_up.ipynb
@@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Speed Up Model with Calibration Config\n\n\n## Introduction\n\nDeep learning network has been computational intensive and memory intensive \nwhich increases the difficulty of deploying deep neural network model. Quantization is a \nfundamental technology which is widely used to reduce memory footprint and speed up inference \nprocess. Many frameworks begin to support quantization, but few of them support mixed precision \nquantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\\, only support simulated mixed precision quantization which will \nnot speed up the inference process. To get real speedup of mixed precision quantization and \nhelp people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different \nDL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model \nwith quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects \nTensorRT at this stage, and will support more backends in the future.\n\n\n## Design and Implementation\n\nTo support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  \nFrontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference \nframework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and \nTensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph \nrepresentation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx \nmodel to generate inference engine. \n\n\nQuantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.\nUsers should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\\  ).\nAfter quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.\n\n\nAfter getting mixed precision engine, users can do inference with input data.\n\n\nNote\n\n\n* Recommend using \"cpu\"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not \"cpu\"(host), this tool will transpose it to \"cpu\" which may increases unnecessary overhead.\n* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).\n* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.\n\n\n## Prerequisite\nCUDA version >= 11.0\n\nTensorRT version >= 7.2\n\nNote\n\n* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\\  \n\n## Usage\n"
+        "\n# SpeedUp Model with Calibration Config\n\n\n## Introduction\n\nDeep learning network has been computational intensive and memory intensive \nwhich increases the difficulty of deploying deep neural network model. Quantization is a \nfundamental technology which is widely used to reduce memory footprint and speedup inference \nprocess. Many frameworks begin to support quantization, but few of them support mixed precision \nquantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\\, only support simulated mixed precision quantization which will \nnot speedup the inference process. To get real speedup of mixed precision quantization and \nhelp people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different \nDL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model \nwith quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects \nTensorRT at this stage, and will support more backends in the future.\n\n\n## Design and Implementation\n\nTo support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  \nFrontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference \nframework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and \nTensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph \nrepresentation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx \nmodel to generate inference engine. \n\n\nQuantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.\nUsers should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\\  ).\nAfter quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.\n\n\nAfter getting mixed precision engine, users can do inference with input data.\n\n\nNote\n\n\n* Recommend using \"cpu\"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not \"cpu\"(host), this tool will transpose it to \"cpu\" which may increases unnecessary overhead.\n* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).\n* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.\n\n\n## Prerequisite\nCUDA version >= 11.0\n\nTensorRT version >= 7.2\n\nNote\n\n* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\\  \n\n## Usage\n"
      ]
    },
    {
@@ -69,7 +69,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "build tensorRT engine to make a real speed up\n\n"
+        "build tensorRT engine to make a real speedup\n\n"
      ]
    },
    {

--- a/docs/source/tutorials/quantization_speed_up.py
+++ b/docs/source/tutorials/quantization_speed_up.py
 """
-Speed Up Model with Calibration Config
+SpeedUp Model with Calibration Config
 ======================================


@@ -8,10 +8,10 @@ Introduction

 Deep learning network has been computational intensive and memory intensive 
 which increases the difficulty of deploying deep neural network model. Quantization is a 
-fundamental technology which is widely used to reduce memory footprint and speed up inference 
+fundamental technology which is widely used to reduce memory footprint and speedup inference 
 process. Many frameworks begin to support quantization, but few of them support mixed precision 
 quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
-not speed up the inference process. To get real speedup of mixed precision quantization and 
+not speedup the inference process. To get real speedup of mixed precision quantization and 
 help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
 DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
 with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
@@ -108,7 +108,7 @@ calibration_config = quantizer.export_model(model_path, calibration_path)
 print("calibration_config: ", calibration_config)

 # %%
-# build tensorRT engine to make a real speed up
+# build tensorRT engine to make a real speedup

 # from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
 # input_shape = (32, 1, 28, 28)

--- a/docs/source/tutorials/quantization_speedup.py.md5
+++ b/docs/source/tutorials/quantization_speedup.py.md5
+4bdf41a0267e314eb516c84d845c9f7b
\ No newline at end of file
--- a/docs/source/tutorials/quantization_speed_up.rst
+++ b/docs/source/tutorials/quantization_speed_up.rst
@@ -2,7 +2,7 @@
 .. DO NOT EDIT.
 .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
 .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
-.. "tutorials/quantization_speed_up.py"
+.. "tutorials/quantization_speedup.py"
 .. LINE NUMBERS ARE GIVEN BELOW.

 .. only:: html
@@ -10,15 +10,15 @@
    .. note::
        :class: sphx-glr-download-link-note

-        Click :ref:`here <sphx_glr_download_tutorials_quantization_speed_up.py>`
+        Click :ref:`here <sphx_glr_download_tutorials_quantization_speedup.py>`
        to download the full example code

 .. rst-class:: sphx-glr-example-title

-.. _sphx_glr_tutorials_quantization_speed_up.py:
+.. _sphx_glr_tutorials_quantization_speedup.py:


-Speed Up Model with Calibration Config
+SpeedUp Model with Calibration Config
 ======================================


@@ -27,10 +27,10 @@ Introduction

 Deep learning network has been computational intensive and memory intensive 
 which increases the difficulty of deploying deep neural network model. Quantization is a 
-fundamental technology which is widely used to reduce memory footprint and speed up inference 
+fundamental technology which is widely used to reduce memory footprint and speedup inference 
 process. Many frameworks begin to support quantization, but few of them support mixed precision 
 quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
-not speed up the inference process. To get real speedup of mixed precision quantization and 
+not speedup the inference process. To get real speedup of mixed precision quantization and 
 help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
 DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
 with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
@@ -123,8 +123,8 @@ Usage

 .. code-block:: none

-    [2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu1'] not found in model
-    [2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu2'] not found in model
+    op_names ['relu1'] not found in model
+    op_names ['relu2'] not found in model

    TorchModel(
      (conv1): QuantizerModuleWrapper(
@@ -162,9 +162,9 @@ finetuning the model by using QAT

 .. code-block:: none

-    Average test loss: 0.2524, Accuracy: 9209/10000 (92%)
-    Average test loss: 0.1711, Accuracy: 9461/10000 (95%)
-    Average test loss: 0.1037, Accuracy: 9690/10000 (97%)
+    Average test loss: 0.3100, Accuracy: 9056/10000 (91%)
+    Average test loss: 0.1559, Accuracy: 9558/10000 (96%)
+    Average test loss: 0.1031, Accuracy: 9690/10000 (97%)



@@ -193,16 +193,14 @@ export model and get calibration_config

 .. code-block:: none

-    [2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Model state_dict saved to ./log/mnist_model.pth
-    [2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Mask dict saved to ./log/mnist_calibration.pth
-    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0026], device='cuda:0'), 'weight_zero_point': tensor([103.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0019], device='cuda:0'), 'weight_zero_point': tensor([116.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 10.175512313842773}}
+    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0031], device='cuda:0'), 'weight_zero_point': tensor([103.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0018], device='cuda:0'), 'weight_zero_point': tensor([111.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 10.046737670898438}}




 .. GENERATED FROM PYTHON SOURCE LINES 111-112

-build tensorRT engine to make a real speed up
+build tensorRT engine to make a real speedup

 .. GENERATED FROM PYTHON SOURCE LINES 112-119

@@ -279,10 +277,10 @@ input tensor: ``torch.randn(128, 3, 32, 32)``

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 0 minutes  52.798 seconds)
+   **Total running time of the script:** ( 0 minutes  55.231 seconds)


-.. _sphx_glr_download_tutorials_quantization_speed_up.py:
+.. _sphx_glr_download_tutorials_quantization_speedup.py:


 .. only :: html
@@ -294,13 +292,13 @@ input tensor: ``torch.randn(128, 3, 32, 32)``

  .. container:: sphx-glr-download sphx-glr-download-python

-     :download:`Download Python source code: quantization_speed_up.py <quantization_speed_up.py>`
+     :download:`Download Python source code: quantization_speedup.py <quantization_speedup.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

-     :download:`Download Jupyter notebook: quantization_speed_up.ipynb <quantization_speed_up.ipynb>`
+     :download:`Download Jupyter notebook: quantization_speedup.ipynb <quantization_speedup.ipynb>`


 .. only:: html

--- a/docs/source/tutorials/quantization_speedup_codeobj.pickle
+++ b/docs/source/tutorials/quantization_speedup_codeobj.pickle
--- a/docs/source/tutorials/sg_execution_times.rst
+++ b/docs/source/tutorials/sg_execution_times.rst
@@ -5,12 +5,14 @@

 Computation times
 =================
-**03:24.740** total execution time for **tutorials** files:
+**02:34.670** total execution time for **tutorials** files:

 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 01:51.644 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py` (``pruning_quick_start_mnist.py``)           | 01:26.953 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py` (``pruning_quick_start_mnist.py``)           | 01:33.096 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_quantization_speedup.py` (``quantization_speedup.py``)                     | 00:55.231 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
+| :ref:`sphx_glr_tutorials_pruning_speedup.py` (``pruning_speedup.py``)                               | 00:12.486 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                                           | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
@@ -20,9 +22,7 @@ Computation times
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorials_pruning_customize.py` (``pruning_customize.py``)                           | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_pruning_speed_up.py` (``pruning_speed_up.py``)                             | 00:00.000 | 0.0 MB |
-+-----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorials_quantization_customize.py` (``quantization_customize.py``)                 | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_speed_up.py` (``quantization_speed_up.py``)                   | 00:00.000 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
--- a/examples/model_compress/end2end_compression.py
+++ b/examples/model_compress/end2end_compression.py
@@ -290,8 +290,8 @@ if __name__ == '__main__':
    #                     help='learning rate to finetune the model')

    # speedup
-    # parser.add_argument('--speed-up', action='store_true', default=False,
-    #                     help='whether to speed-up the pruned model')
+    # parser.add_argument('--speedup', action='store_true', default=False,
+    #                     help='whether to speedup the pruned model')

    # parser.add_argument('--nni', action='store_true', default=False,
    #                     help="whether to tune the pruners using NNi tuners")

--- a/examples/model_compress/pruning/auto_pruners_torch.py
+++ b/examples/model_compress/pruning/auto_pruners_torch.py
@@ -292,8 +292,8 @@ def main(args):
            os.path.join(args.experiment_data_dir, 'model_masked.pth'), os.path.join(args.experiment_data_dir, 'mask.pth'))
        print('Masked model saved to %s' % args.experiment_data_dir)

-    # model speed up
-    if args.speed_up:
+    # model speedup
+    if args.speedup:
        if args.pruner != 'AutoCompressPruner':
            if args.model == 'LeNet':
                model = LeNet().to(device)
@@ -310,11 +310,11 @@ def main(args):
            m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)
            m_speedup.speedup_model()
            evaluation_result = evaluator(model)
-            print('Evaluation result (speed up model): %s' % evaluation_result)
+            print('Evaluation result (speedup model): %s' % evaluation_result)
            result['performance']['speedup'] = evaluation_result

-            torch.save(model.state_dict(), os.path.join(args.experiment_data_dir, 'model_speed_up.pth'))
-            print('Speed up model saved to %s' % args.experiment_data_dir)
+            torch.save(model.state_dict(), os.path.join(args.experiment_data_dir, 'model_speedup.pth'))
+            print('Speedup model saved to %s' % args.experiment_data_dir)
        flops, params, _ = count_flops_params(model, get_input_size(args.dataset))
        result['flops']['speedup'] = flops
        result['params']['speedup'] = params
@@ -402,9 +402,9 @@ if __name__ == '__main__':
    parser.add_argument('--sparsity-per-iteration', type=float, default=0.05,
                        help='sparsity_per_iteration of NetAdaptPruner')

-    # speed-up
-    parser.add_argument('--speed-up', type=str2bool, default=False,
-                        help='Whether to speed-up the pruned model')
+    # speedup
+    parser.add_argument('--speedup', type=str2bool, default=False,
+                        help='Whether to speedup the pruned model')

    # others
    parser.add_argument('--log-interval', type=int, default=200,

--- a/examples/model_compress/pruning/basic_pruners_torch.py
+++ b/examples/model_compress/pruning/basic_pruners_torch.py
@@ -4,7 +4,7 @@
 '''
 NNI example for supported basic pruning algorithms.
 In this example, we show the end-to-end pruning process: pre-training -> pruning -> fine-tuning.
-Note that pruners use masks to simulate the real pruning. In order to obtain a real compressed model, model speed up is required.
+Note that pruners use masks to simulate the real pruning. In order to obtain a real compressed model, model speedup is required.
 You can also try auto_pruners_torch.py to see the usage of some automatic pruning algorithms.

 '''
@@ -292,7 +292,7 @@ def main(args):
    if args.test_only:
        test(args, model, device, criterion, test_loader)

-    if args.speed_up:
+    if args.speedup:
        # Unwrap all modules to normal state
        pruner._unwrap_model()
        m_speedup = ModelSpeedup(model, dummy_input, mask_path, device)
@@ -364,9 +364,9 @@ if __name__ == '__main__':
                                 'fpgm', 'mean_activation', 'apoz', 'taylorfo'],
                        help='pruner to use')

-    # speed-up
-    parser.add_argument('--speed-up', action='store_true', default=False,
-                        help='Whether to speed-up the pruned model')
+    # speedup
+    parser.add_argument('--speedup', action='store_true', default=False,
+                        help='Whether to speedup the pruned model')

    # fine-tuning
    parser.add_argument('--fine-tune-epochs', type=int, default=160,

--- a/examples/model_compress/pruning/mobilenetv2_end2end/Compressing MobileNetV2 with NNI Pruners.ipynb
+++ b/examples/model_compress/pruning/mobilenetv2_end2end/Compressing MobileNetV2 with NNI Pruners.ipynb