[Doc] add an example for compression config list & add videos in doc (#4890)

c299e576 · J-shang · GitHub · 64ae3d73 · c299e576 · c299e576
Unverified Commit c299e576 authored Jun 09, 2022 by J-shang Committed by GitHub Jun 09, 2022
20 changed files
--- a/dependencies/develop.txt
+++ b/dependencies/develop.txt
@@ -18,4 +18,5 @@ sphinx-gallery
 sphinx-intl
 sphinx-tabs
 sphinxcontrib-bibtex
+sphinxcontrib-youtube
 git+https://github.com/bashtage/sphinx-material@6e0ef82#egg=sphinx_material
--- a/docs/source/compression/compression_config_list.rst
+++ b/docs/source/compression/compression_config_list.rst
@@ -93,3 +93,59 @@ quant_start_step
 Specific key for ``QAT Quantizer``. Disable quantization until model are run by certain number of steps,
 this allows the network to enter a more stable.
 State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
+
+Examples
+--------
+
+Suppose we want to compress the following model::
+
+    class Model(nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.conv1 = nn.Conv2d(1, 32, 3, 1)
+            self.conv2 = nn.Conv2d(32, 64, 3, 1)
+            self.dropout1 = nn.Dropout2d(0.25)
+            self.dropout2 = nn.Dropout2d(0.5)
+            self.fc1 = nn.Linear(9216, 128)
+            self.fc2 = nn.Linear(128, 10)
+
+        def forward(self, x):
+            ...
+    
+First, we need to determine where to compress, use the following config list to specify all ``Conv2d`` modules and module named ``fc1``::
+
+    config_list = [{'op_types': ['Conv2d']}, {'op_names': ['fc1']}]
+
+Sometimes we may need to compress all modules of a certain type, except for a few special ones.
+Writing all the module names is laborious at this point, we can use ``exclude`` to quickly specify the compression target modules::
+
+    config_list = [{
+        'op_types': ['Conv2d', 'Linear']
+    }, {
+        'exclude': True,
+        'op_names': ['fc2']
+    }]
+
+The above two config lists are equivalent to the model we want to compress, they both use ``conv1``, ``conv2``, and ``fc1`` as compression targets.
+
+Let's take a simple pruning config list example, pruning all ``Conv2d`` modules with 50% sparsity, and pruning ``fc1`` with 80% sparsity::
+
+    config_list = [{
+        'op_types': ['Conv2d'],
+        'total_sparsity': 0.5
+    }, {
+        'op_names': ['fc1'],
+        'total_sparsity': 0.8
+    }]
+
+Then if you want to try model quantization, here is a simple config list example::
+
+    config_list = [{
+        'op_types': ['Conv2d'],
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8}
+    }, {
+        'op_names': ['fc1'],
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8}
+    }]
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -53,6 +53,7 @@ extensions = [
    'sphinx.ext.viewcode',
    'sphinx.ext.intersphinx',
    'sphinxcontrib.bibtex',
+    'sphinxcontrib.youtube',
    # 'nbsphinx',  # nbsphinx has conflicts with sphinx-gallery.
    'sphinx.ext.extlinks',
    'IPython.sphinxext.ipython_console_highlighting',

--- a/docs/source/experiment/web_portal/web_portal_zh.rst
+++ b/docs/source/experiment/web_portal/web_portal_zh.rst
-.. 16ce3c41e8ec5389a2071e6cbe56ccab
+.. a6a9f0292afa81c7796304ae7da5afcd

 Web 界面
 ========

--- a/docs/source/locales/zh/LC_MESSAGES/tutorials.po
+++ b/docs/source/locales/zh/LC_MESSAGES/tutorials.po
@@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: NNI \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2022-04-20 05:50+0000\n"
+"POT-Creation-Date: 2022-05-27 16:52+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language-Team: LANGUAGE <LL@li.org>\n"
@@ -127,12 +127,12 @@ msgstr ""
 #: ../../source/tutorials/hello_nas.rst:564
 #: ../../source/tutorials/hpo_quickstart_pytorch/main.rst:244
 #: ../../source/tutorials/hpo_quickstart_pytorch/main.rst:281
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:65
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:107
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:172
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:218
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:255
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:283
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:70
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:112
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:177
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:223
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:260
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:288
 msgid "Out:"
 msgstr ""

@@ -347,7 +347,7 @@ msgstr ""

 #: ../../source/tutorials/hello_nas.rst:625
 #: ../../source/tutorials/hpo_quickstart_pytorch/main.rst:335
-#: ../../source/tutorials/pruning_quick_start_mnist.rst:357
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:362
 msgid "`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_"
 msgstr ""

@@ -648,49 +648,53 @@ msgid "Pruning Quickstart"
 msgstr ""

 #: ../../source/tutorials/pruning_quick_start_mnist.rst:24
+msgid "Here is a three-minute video to get you started with model pruning."
+msgstr ""
+
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:29
 msgid ""
 "Model pruning is a technique to reduce the model size and computation by "
 "reducing model weight size or intermediate state size. There are three "
 "common practices for pruning a DNN model:"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:27
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:32
 msgid "Pre-training a model -> Pruning the model -> Fine-tuning the pruned model"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:28
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:33
 msgid ""
 "Pruning a model during training (i.e., pruning aware training) -> Fine-"
 "tuning the pruned model"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:29
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:34
 msgid "Pruning a model -> Training the pruned model from scratch"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:31
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:36
 msgid ""
 "NNI supports all of the above pruning practices by working on the key "
 "pruning stage. Following this tutorial for a quick look at how to use NNI"
 " to prune a model in a common practice."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:37
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:42
 msgid "Preparation"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:39
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:44
 msgid ""
 "In this tutorial, we use a simple model and pre-trained on MNIST dataset."
 " If you are familiar with defining a model and training in pytorch, you "
 "can skip directly to `Pruning Model`_."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:121
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:126
 msgid "Pruning Model"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:123
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:128
 msgid ""
 "Using L1NormPruner to prune the model and generate the masks. Usually, a "
 "pruner requires original model and ``config_list`` as its inputs. "
@@ -699,7 +703,7 @@ msgid ""
 "<../compression/compression_config_list>`."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:127
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:132
 msgid ""
 "The following `config_list` means all layers whose type is `Linear` or "
 "`Conv2d` will be pruned, except the layer named `fc3`, because `fc3` is "
@@ -707,11 +711,11 @@ msgid ""
 "named `fc3` will not be pruned."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:153
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:158
 msgid "Pruners usually require `model` and `config_list` as input arguments."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:232
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:237
 msgid ""
 "Speedup the original model with masks, note that `ModelSpeedup` requires "
 "an unwrapped model. The model becomes smaller after speedup, and reaches "
@@ -719,32 +723,32 @@ msgid ""
 "across layers."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:269
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:274
 msgid "the model will become real smaller after speedup"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:307
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:312
 msgid "Fine-tuning Compacted Model"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:308
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:313
 msgid ""
 "Note that if the model has been sped up, you need to re-initialize a new "
 "optimizer for fine-tuning. Because speedup will replace the masked big "
 "layers with dense small ones."
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:329
-msgid "**Total running time of the script:** ( 0 minutes  58.337 seconds)"
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:334
+msgid "**Total running time of the script:** ( 1 minutes  30.730 seconds)"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:344
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:349
 msgid ""
 ":download:`Download Python source code: pruning_quick_start_mnist.py "
 "<pruning_quick_start_mnist.py>`"
 msgstr ""

-#: ../../source/tutorials/pruning_quick_start_mnist.rst:350
+#: ../../source/tutorials/pruning_quick_start_mnist.rst:355
 msgid ""
 ":download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb "
 "<pruning_quick_start_mnist.ipynb>`"
@@ -771,3 +775,6 @@ msgstr ""
 #~ msgid "**Total running time of the script:** ( 1 minutes  24.393 seconds)"
 #~ msgstr ""

+#~ msgid "**Total running time of the script:** ( 0 minutes  58.337 seconds)"
+#~ msgstr ""
+
--- a/docs/source/tutorials/index.rst
+++ b/docs/source/tutorials/index.rst
@@ -53,7 +53,7 @@ Tutorials

 .. raw:: html

-    <div class="sphx-glr-thumbcontainer" tooltip="Quantization reduces model size and speeds up inference time by reducing the number of bits req...">
+    <div class="sphx-glr-thumbcontainer" tooltip="Here is a four-minute video to get you started with model quantization.">

 .. only:: html

@@ -74,7 +74,7 @@ Tutorials

 .. raw:: html

-    <div class="sphx-glr-thumbcontainer" tooltip="Model pruning is a technique to reduce the model size and computation by reducing model weight ...">
+    <div class="sphx-glr-thumbcontainer" tooltip="Here is a three-minute video to get you started with model pruning.">

 .. only:: html


--- a/docs/source/tutorials/pruning_quick_start_mnist.ipynb
+++ b/docs/source/tutorials/pruning_quick_start_mnist.ipynb
@@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
+        "\n# Pruning Quickstart\n\nHere is a three-minute video to get you started with model pruning.\n\n..  youtube:: wKh51Jnr0a8\n    :align: center\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
      ]
    },
    {
@@ -165,7 +165,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.9.7"
+      "version": "3.8.8"
    }
  },
  "nbformat": 4,

--- a/docs/source/tutorials/pruning_quick_start_mnist.py
+++ b/docs/source/tutorials/pruning_quick_start_mnist.py
@@ -2,6 +2,11 @@
 Pruning Quickstart
 ==================

+Here is a three-minute video to get you started with model pruning.
+
+..  youtube:: wKh51Jnr0a8
+    :align: center
+
 Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
 There are three common practices for pruning a DNN model:


--- a/docs/source/tutorials/pruning_quick_start_mnist.py.md5
+++ b/docs/source/tutorials/pruning_quick_start_mnist.py.md5
-930f8ee2f57b70037e3231152a72606c
\ No newline at end of file
+33781311d6344b4aebb94db94a96dfd3
\ No newline at end of file
--- a/docs/source/tutorials/pruning_quick_start_mnist.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist.rst
@@ -21,6 +21,11 @@
 Pruning Quickstart
 ==================

+Here is a three-minute video to get you started with model pruning.
+
+..  youtube:: wKh51Jnr0a8
+    :align: center
+
 Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
 There are three common practices for pruning a DNN model:

@@ -31,7 +36,7 @@ There are three common practices for pruning a DNN model:
 NNI supports all of the above pruning practices by working on the key pruning stage.
 Following this tutorial for a quick look at how to use NNI to prune a model in a common practice.

-.. GENERATED FROM PYTHON SOURCE LINES 17-22
+.. GENERATED FROM PYTHON SOURCE LINES 22-27

 Preparation
 -----------
@@ -39,7 +44,7 @@ Preparation
 In this tutorial, we use a simple model and pre-trained on MNIST dataset.
 If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.

-.. GENERATED FROM PYTHON SOURCE LINES 22-35
+.. GENERATED FROM PYTHON SOURCE LINES 27-40

 .. code-block:: default

@@ -83,7 +88,7 @@ If you are familiar with defining a model and training in pytorch, you can skip



-.. GENERATED FROM PYTHON SOURCE LINES 36-47
+.. GENERATED FROM PYTHON SOURCE LINES 41-52

 .. code-block:: default

@@ -108,14 +113,14 @@ If you are familiar with defining a model and training in pytorch, you can skip

 .. code-block:: none

-    Average test loss: 0.5368, Accuracy: 8321/10000 (83%)
-    Average test loss: 0.3092, Accuracy: 9104/10000 (91%)
-    Average test loss: 0.2070, Accuracy: 9380/10000 (94%)
+    Average test loss: 0.4925, Accuracy: 8414/10000 (84%)
+    Average test loss: 0.2626, Accuracy: 9214/10000 (92%)
+    Average test loss: 0.2006, Accuracy: 9369/10000 (94%)




-.. GENERATED FROM PYTHON SOURCE LINES 48-58
+.. GENERATED FROM PYTHON SOURCE LINES 53-63

 Pruning Model
 -------------
@@ -128,7 +133,7 @@ The following `config_list` means all layers whose type is `Linear` or `Conv2d`
 except the layer named `fc3`, because `fc3` is `exclude`.
 The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.

-.. GENERATED FROM PYTHON SOURCE LINES 58-67
+.. GENERATED FROM PYTHON SOURCE LINES 63-72

 .. code-block:: default

@@ -148,11 +153,11 @@ The final sparsity ratio for each layer is 50%. The layer named `fc3` will not b



-.. GENERATED FROM PYTHON SOURCE LINES 68-69
+.. GENERATED FROM PYTHON SOURCE LINES 73-74

 Pruners usually require `model` and `config_list` as input arguments.

-.. GENERATED FROM PYTHON SOURCE LINES 69-76
+.. GENERATED FROM PYTHON SOURCE LINES 74-81

 .. code-block:: default

@@ -198,7 +203,7 @@ Pruners usually require `model` and `config_list` as input arguments.



-.. GENERATED FROM PYTHON SOURCE LINES 77-84
+.. GENERATED FROM PYTHON SOURCE LINES 82-89

 .. code-block:: default

@@ -227,13 +232,13 @@ Pruners usually require `model` and `config_list` as input arguments.



-.. GENERATED FROM PYTHON SOURCE LINES 85-88
+.. GENERATED FROM PYTHON SOURCE LINES 90-93

 Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
 The model becomes smaller after speedup,
 and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.

-.. GENERATED FROM PYTHON SOURCE LINES 88-97
+.. GENERATED FROM PYTHON SOURCE LINES 93-102

 .. code-block:: default

@@ -258,17 +263,17 @@ and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the ma

    aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
    Note: .aten::log_softmax.12 does not have corresponding mask inference object
-    /home/nishang/anaconda3/envs/MCM/lib/python3.9/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811803361/work/build/aten/src/ATen/core/TensorBody.h:417.)
+    /home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten/src/ATen/core/TensorBody.h:417.)
      return self._grad




-.. GENERATED FROM PYTHON SOURCE LINES 98-99
+.. GENERATED FROM PYTHON SOURCE LINES 103-104

 the model will become real smaller after speedup

-.. GENERATED FROM PYTHON SOURCE LINES 99-101
+.. GENERATED FROM PYTHON SOURCE LINES 104-106

 .. code-block:: default

@@ -301,14 +306,14 @@ the model will become real smaller after speedup



-.. GENERATED FROM PYTHON SOURCE LINES 102-106
+.. GENERATED FROM PYTHON SOURCE LINES 107-111

 Fine-tuning Compacted Model
 ---------------------------
 Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
 Because speedup will replace the masked big layers with dense small ones.

-.. GENERATED FROM PYTHON SOURCE LINES 106-110
+.. GENERATED FROM PYTHON SOURCE LINES 111-115

 .. code-block:: default

@@ -326,7 +331,7 @@ Because speedup will replace the masked big layers with dense small ones.

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 0 minutes  58.337 seconds)
+   **Total running time of the script:** ( 1 minutes  30.730 seconds)


 .. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:

--- a/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
-.. 5f266ace988c9ca9e44555fdc497e9ba
+.. b743ab67f64dd0a0688a8cb184e0e947

    .. note::
        :class: sphx-glr-download-link-note
@@ -14,6 +14,11 @@
 模型剪枝入门
 ============

+下面是一个三分钟快速入门模型剪枝的视频。
+
+..  youtube:: wKh51Jnr0a8
+    :align: center
+
 模型剪枝是一种通过减小模型权重规模或中间状态规模来减小模型大小和计算量的技术。
 修剪 DNN 模型有三种常见做法：


--- a/docs/source/tutorials/quantization_quick_start_mnist.ipynb
+++ b/docs/source/tutorials/quantization_quick_start_mnist.ipynb
@@ -15,7 +15,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "\n# Quantization Quickstart\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.\n"
+        "\n# Quantization Quickstart\n\nHere is a four-minute video to get you started with model quantization.\n\n..  youtube:: MSfV7AyfiA4\n    :align: center\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.\n"
      ]
    },
    {
@@ -143,7 +143,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.9.7"
+      "version": "3.8.8"
    }
  },
  "nbformat": 4,

--- a/docs/source/tutorials/quantization_quick_start_mnist.py
+++ b/docs/source/tutorials/quantization_quick_start_mnist.py
@@ -2,6 +2,11 @@
 Quantization Quickstart
 =======================

+Here is a four-minute video to get you started with model quantization.
+
+..  youtube:: MSfV7AyfiA4
+    :align: center
+
 Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.

 In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.

--- a/docs/source/tutorials/quantization_quick_start_mnist.py.md5
+++ b/docs/source/tutorials/quantization_quick_start_mnist.py.md5
-bceaf8235b437428267b614af06634a0
\ No newline at end of file
+2995cef94c5c6c66a6dfa4b5ff28baea
\ No newline at end of file
--- a/docs/source/tutorials/quantization_quick_start_mnist.rst
+++ b/docs/source/tutorials/quantization_quick_start_mnist.rst
@@ -21,12 +21,17 @@
 Quantization Quickstart
 =======================

+Here is a four-minute video to get you started with model quantization.
+
+..  youtube:: MSfV7AyfiA4
+    :align: center
+
 Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.

 In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
 Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.

-.. GENERATED FROM PYTHON SOURCE LINES 12-17
+.. GENERATED FROM PYTHON SOURCE LINES 17-22

 Preparation
 -----------
@@ -34,7 +39,7 @@ Preparation
 In this tutorial, we use a simple model and pre-train on MNIST dataset.
 If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.

-.. GENERATED FROM PYTHON SOURCE LINES 17-37
+.. GENERATED FROM PYTHON SOURCE LINES 22-42

 .. code-block:: default

@@ -68,14 +73,14 @@ If you are familiar with defining a model and training in pytorch, you can skip

 .. code-block:: none

-    Average test loss: 0.7073, Accuracy: 7624/10000 (76%)
-    Average test loss: 0.2776, Accuracy: 9122/10000 (91%)
-    Average test loss: 0.1907, Accuracy: 9412/10000 (94%)
+    Average test loss: 0.5901, Accuracy: 8293/10000 (83%)
+    Average test loss: 0.2469, Accuracy: 9245/10000 (92%)
+    Average test loss: 0.1586, Accuracy: 9531/10000 (95%)




-.. GENERATED FROM PYTHON SOURCE LINES 38-43
+.. GENERATED FROM PYTHON SOURCE LINES 43-48

 Quantizing Model
 ----------------
@@ -83,7 +88,7 @@ Quantizing Model
 Initialize a `config_list`.
 Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.

-.. GENERATED FROM PYTHON SOURCE LINES 43-58
+.. GENERATED FROM PYTHON SOURCE LINES 48-63

 .. code-block:: default

@@ -109,11 +114,11 @@ Detailed about how to write ``config_list`` please refer :doc:`compression confi



-.. GENERATED FROM PYTHON SOURCE LINES 59-60
+.. GENERATED FROM PYTHON SOURCE LINES 64-65

 finetuning the model by using QAT

-.. GENERATED FROM PYTHON SOURCE LINES 60-65
+.. GENERATED FROM PYTHON SOURCE LINES 65-70

 .. code-block:: default

@@ -165,13 +170,13 @@ finetuning the model by using QAT



-.. GENERATED FROM PYTHON SOURCE LINES 66-69
+.. GENERATED FROM PYTHON SOURCE LINES 71-74

 The model has now been wrapped, and quantization targets ('quant_types' setting in `config_list`)
 will be quantized & dequantized for simulated quantization in the wrapped layers.
 QAT is a training-aware quantizer, it will update scale and zero point during training.

-.. GENERATED FROM PYTHON SOURCE LINES 69-74
+.. GENERATED FROM PYTHON SOURCE LINES 74-79

 .. code-block:: default

@@ -190,18 +195,18 @@ QAT is a training-aware quantizer, it will update scale and zero point during tr

 .. code-block:: none

-    Average test loss: 0.1542, Accuracy: 9529/10000 (95%)
-    Average test loss: 0.1133, Accuracy: 9664/10000 (97%)
-    Average test loss: 0.0919, Accuracy: 9726/10000 (97%)
+    Average test loss: 0.1333, Accuracy: 9587/10000 (96%)
+    Average test loss: 0.1076, Accuracy: 9660/10000 (97%)
+    Average test loss: 0.0957, Accuracy: 9702/10000 (97%)




-.. GENERATED FROM PYTHON SOURCE LINES 75-76
+.. GENERATED FROM PYTHON SOURCE LINES 80-81

 export model and get calibration_config

-.. GENERATED FROM PYTHON SOURCE LINES 76-82
+.. GENERATED FROM PYTHON SOURCE LINES 81-87

 .. code-block:: default

@@ -221,16 +226,18 @@ export model and get calibration_config

 .. code-block:: none

-    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0031], device='cuda:0'), 'weight_zero_point': tensor([76.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0018], device='cuda:0'), 'weight_zero_point': tensor([113.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 12.42452621459961}, 'fc1': {'weight_bits': 8, 'weight_scale': tensor([0.0011], device='cuda:0'), 'weight_zero_point': tensor([124.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 31.650196075439453}, 'fc2': {'weight_bits': 8, 'weight_scale': tensor([0.0013], device='cuda:0'), 'weight_zero_point': tensor([122.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 25.805370330810547}, 'relu1': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 12.499907493591309}, 'relu2': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 32.0243034362793}, 'relu3': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 26.491384506225586}, 'relu4': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 17.662996292114258}}
+    INFO:nni.compression.pytorch.compressor:Model state_dict saved to ./log/mnist_model.pth
+    INFO:nni.compression.pytorch.compressor:Mask dict saved to ./log/mnist_calibration.pth
+    calibration_config:  {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0029], device='cuda:0'), 'weight_zero_point': tensor([96.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0017], device='cuda:0'), 'weight_zero_point': tensor([101.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 10.014460563659668}, 'fc1': {'weight_bits': 8, 'weight_scale': tensor([0.0012], device='cuda:0'), 'weight_zero_point': tensor([118.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 25.994585037231445}, 'fc2': {'weight_bits': 8, 'weight_scale': tensor([0.0012], device='cuda:0'), 'weight_zero_point': tensor([120.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 21.589195251464844}, 'relu1': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 10.066218376159668}, 'relu2': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 26.317869186401367}, 'relu3': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 21.97711944580078}, 'relu4': {'output_bits': 8, 'tracked_min_output': 0.0, 'tracked_max_output': 17.56885528564453}}




-.. GENERATED FROM PYTHON SOURCE LINES 83-84
+.. GENERATED FROM PYTHON SOURCE LINES 88-89

 build tensorRT engine to make a real speedup, for more information about speedup, please refer :doc:`quantization_speedup`.

-.. GENERATED FROM PYTHON SOURCE LINES 84-90
+.. GENERATED FROM PYTHON SOURCE LINES 89-95

 .. code-block:: default

@@ -250,8 +257,8 @@ build tensorRT engine to make a real speedup, for more information about speedup

 .. code-block:: none

-    Loss: 0.09358334274291992  Accuracy: 97.21%
-    Inference elapsed_time (whole dataset): 0.04445981979370117s
+    Loss: 0.09545102081298829  Accuracy: 96.98%
+    Inference elapsed_time (whole dataset): 0.03549933433532715s



@@ -259,7 +266,7 @@ build tensorRT engine to make a real speedup, for more information about speedup

 .. rst-class:: sphx-glr-timing

-   **Total running time of the script:** ( 1 minutes  36.499 seconds)
+   **Total running time of the script:** ( 1 minutes  45.743 seconds)


 .. _sphx_glr_download_tutorials_quantization_quick_start_mnist.py:

--- a/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/quantization_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/sg_execution_times.rst
+++ b/docs/source/tutorials/sg_execution_times.rst
@@ -5,10 +5,10 @@

 Computation times
 =================
-**01:04.509** total execution time for **tutorials** files:
+**01:45.743** total execution time for **tutorials** files:

 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_speedup.py` (``quantization_speedup.py``)                     | 01:04.509 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 01:45.743 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``)                                           | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
@@ -22,5 +22,5 @@ Computation times
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
 | :ref:`sphx_glr_tutorials_quantization_customize.py` (``quantization_customize.py``)                 | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
-| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
+| :ref:`sphx_glr_tutorials_quantization_speedup.py` (``quantization_speedup.py``)                     | 00:00.000 | 0.0 MB |
 +-----------------------------------------------------------------------------------------------------+-----------+--------+
--- a/examples/tutorials/pruning_quick_start_mnist.py
+++ b/examples/tutorials/pruning_quick_start_mnist.py
@@ -2,6 +2,11 @@
 Pruning Quickstart
 ==================

+Here is a three-minute video to get you started with model pruning.
+
+..  youtube:: wKh51Jnr0a8
+    :align: center
+
 Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
 There are three common practices for pruning a DNN model:


--- a/examples/tutorials/quantization_quick_start_mnist.py
+++ b/examples/tutorials/quantization_quick_start_mnist.py
@@ -2,6 +2,11 @@
 Quantization Quickstart
 =======================

+Here is a four-minute video to get you started with model quantization.
+
+..  youtube:: MSfV7AyfiA4
+    :align: center
+
 Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.

 In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.