Resolve conflicts for #4760 (#4762)

a911b856 · Yuge Zhang · GitHub · 14d2966b · 14d2966b · a911b856
Unverified Commit a911b856 authored Apr 21, 2022 by Yuge Zhang Committed by GitHub Apr 21, 2022
20 changed files
--- a/docs/source/tutorials/nni_experiment_codeobj.pickle
+++ b/docs/source/tutorials/nni_experiment_codeobj.pickle
--- a/docs/source/tutorials/pruning_customize.ipynb
+++ b/docs/source/tutorials/pruning_customize.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Customize Basic Pruner\n\nUsers can easily customize a basic pruner in NNI. A large number of basic modules have been provided and can be reused.\nFollow the NNI pruning interface, users only need to focus on their creative parts without worrying about other regular modules.\n\nIn this tutorial, we show how to customize a basic pruner.\n\n## Concepts\n\nNNI abstracts the basic pruning process into three steps, collecting data, calculating metrics, allocating sparsity.\nMost pruning algorithms rely on a metric to decide where should be pruned. Using L1 norm pruner as an example,\nthe first step is collecting model weights, the second step is calculating L1 norm for weight per output channel,\nthe third step is ranking L1 norm metric and masking the output channels that have small L1 norm.\n\nIn NNI basic pruner, these three step is implement as ``DataCollector``, ``MetricsCalculator`` and ``SparsityAllocator``.\n\n-   ``DataCollector``: This module take pruner as initialize parameter.\n    It will get the relevant information of the model from the pruner,\n    and sometimes it will also hook the model to get input, output or gradient of a layer or a tensor.\n    It can also patch optimizer if some special steps need to be executed before or after ``optimizer.step()``.\n\n-   ``MetricsCalculator``: This module will take the data collected from the ``DataCollector``,\n    then calculate the metrics. The metric shape is usually reduced from the data shape.\n    The ``dim`` taken by ``MetricsCalculator`` means which dimension will be kept after calculate metrics.\n    i.e., the collected data shape is (10, 20, 30), and the ``dim`` is 1, then the dimension-1 will be kept,\n    the output metrics shape should be (20,).\n\n-   ``SparsityAllocator``: This module take the metrics and generate the masks.\n    Different ``SparsityAllocator`` has different masks generation strategies.\n    A common and simple strategy is sorting the metrics' values and calculating a threshold according to the configured sparsity,\n    mask the positions which metric value smaller than the threshold.\n    The ``dim`` taken by ``SparsityAllocator`` means the metrics are for which dimension, the mask will be expanded to weight shape.\n    i.e., the metric shape is (20,), the corresponding layer weight shape is (20, 40), and the ``dim`` is 0.\n    ``SparsityAllocator`` will first generate a mask with shape (20,), then expand this mask to shape (20, 40).\n\n## Simple Example: Customize a Block-L1NormPruner\n\nNNI already have L1NormPruner, but for the reason of reproducing the paper and reducing user configuration items,\nit only support pruning layer output channels. In this example, we will customize a pruner that supports block granularity for Linear.\n\nNote that you don't need to implement all these three kinds of tools for each time,\nNNI supports many predefined tools, and you can directly use these to customize your own pruner.\nThis is a tutorial so we show how to define all these three kinds of pruning tools.\n\nCustomize the pruning tools used by the pruner at first.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import torch\nfrom nni.algorithms.compression.v2.pytorch.pruning.basic_pruner import BasicPruner\nfrom nni.algorithms.compression.v2.pytorch.pruning.tools import (\n    DataCollector,\n    MetricsCalculator,\n    SparsityAllocator\n)\n\n\n# This data collector collects weight in wrapped module as data.\n# The wrapped module is the module configured in pruner's config_list.\n# This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.WeightDataCollector\nclass WeightDataCollector(DataCollector):\n    def collect(self):\n        data = {}\n        # get_modules_wrapper will get all the wrapper in the compressor (pruner),\n        # it returns a dict with format {wrapper_name: wrapper},\n        # use wrapper.module to get the wrapped module.\n        for _, wrapper in self.compressor.get_modules_wrapper().items():\n            data[wrapper.name] = wrapper.module.weight.data\n        # return {wrapper_name: weight_data}\n        return data\n\n\nclass BlockNormMetricsCalculator(MetricsCalculator):\n    def __init__(self, block_sparse_size):\n        # Because we will keep all dimension with block granularity, so fix ``dim=None``,\n        # means all dimensions will be kept.\n        super().__init__(dim=None, block_sparse_size=block_sparse_size)\n\n    def calculate_metrics(self, data):\n        data_length = len(self.block_sparse_size)\n        reduce_unfold_dims = list(range(data_length, 2 * data_length))\n\n        metrics = {}\n        for name, t in data.items():\n            # Unfold t as block size, and calculate L1 Norm for each block.\n            for dim, size in enumerate(self.block_sparse_size):\n                t = t.unfold(dim, size, size)\n            metrics[name] = t.norm(dim=reduce_unfold_dims, p=1)\n        # return {wrapper_name: block_metric}\n        return metrics\n\n\n# This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.NormalSparsityAllocator\nclass BlockSparsityAllocator(SparsityAllocator):\n    def __init__(self, pruner, block_sparse_size):\n        super().__init__(pruner, dim=None, block_sparse_size=block_sparse_size, continuous_mask=True)\n\n    def generate_sparsity(self, metrics):\n        masks = {}\n        for name, wrapper in self.pruner.get_modules_wrapper().items():\n            # wrapper.config['total_sparsity'] can get the configured sparsity ratio for this wrapped module\n            sparsity_rate = wrapper.config['total_sparsity']\n            # get metric for this wrapped module\n            metric = metrics[name]\n            # mask the metric with old mask, if the masked position need never recover,\n            # just keep this is ok if you are new in NNI pruning\n            if self.continuous_mask:\n                metric *= self._compress_mask(wrapper.weight_mask)\n            # convert sparsity ratio to prune number\n            prune_num = int(sparsity_rate * metric.numel())\n            # calculate the metric threshold\n            threshold = torch.topk(metric.view(-1), prune_num, largest=False)[0].max()\n            # generate mask, keep the metric positions that metric values greater than the threshold\n            mask = torch.gt(metric, threshold).type_as(metric)\n            # expand the mask to weight size, if the block is masked, this block will be filled with zeros,\n            # otherwise filled with ones\n            masks[name] = self._expand_mask(name, mask)\n            # merge the new mask with old mask, if the masked position need never recover,\n            # just keep this is ok if you are new in NNI pruning\n            if self.continuous_mask:\n                masks[name]['weight'] *= wrapper.weight_mask\n        return masks"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Customize the pruner.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "class BlockL1NormPruner(BasicPruner):\n    def __init__(self, model, config_list, block_sparse_size):\n        self.block_sparse_size = block_sparse_size\n        super().__init__(model, config_list)\n\n    # Implement reset_tools is enough for this pruner.\n    def reset_tools(self):\n        if self.data_collector is None:\n            self.data_collector = WeightDataCollector(self)\n        else:\n            self.data_collector.reset()\n        if self.metrics_calculator is None:\n            self.metrics_calculator = BlockNormMetricsCalculator(self.block_sparse_size)\n        if self.sparsity_allocator is None:\n            self.sparsity_allocator = BlockSparsityAllocator(self, self.block_sparse_size)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Try this pruner.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# Define a simple model.\nclass TestModel(torch.nn.Module):\n    def __init__(self) -> None:\n        super().__init__()\n        self.fc1 = torch.nn.Linear(4, 8)\n        self.fc2 = torch.nn.Linear(8, 4)\n\n    def forward(self, x):\n        return self.fc2(self.fc1(x))\n\nmodel = TestModel()\nconfig_list = [{'op_types': ['Linear'], 'total_sparsity': 0.5}]\n# use 2x2 block\n_, masks = BlockL1NormPruner(model, config_list, [2, 2]).compress()\n\n# show the generated masks\nprint('fc1 masks:\\n', masks['fc1']['weight'])\nprint('fc2 masks:\\n', masks['fc2']['weight'])"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This time we successfully define a new pruner with pruning block granularity!\nNote that we don't put validation logic in this example, like ``_validate_config_before_canonical``,\nbut for a robust implementation, we suggest you involve the validation logic.\n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.8"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/docs/source/tutorials/pruning_customize.py
+++ b/docs/source/tutorials/pruning_customize.py
+"""
+Customize Basic Pruner
+======================
+
+Users can easily customize a basic pruner in NNI. A large number of basic modules have been provided and can be reused.
+Follow the NNI pruning interface, users only need to focus on their creative parts without worrying about other regular modules.
+
+In this tutorial, we show how to customize a basic pruner.
+
+Concepts
+--------
+
+NNI abstracts the basic pruning process into three steps, collecting data, calculating metrics, allocating sparsity.
+Most pruning algorithms rely on a metric to decide where should be pruned. Using L1 norm pruner as an example,
+the first step is collecting model weights, the second step is calculating L1 norm for weight per output channel,
+the third step is ranking L1 norm metric and masking the output channels that have small L1 norm.
+
+In NNI basic pruner, these three step is implement as ``DataCollector``, ``MetricsCalculator`` and ``SparsityAllocator``.
+
+-   ``DataCollector``: This module take pruner as initialize parameter.
+    It will get the relevant information of the model from the pruner,
+    and sometimes it will also hook the model to get input, output or gradient of a layer or a tensor.
+    It can also patch optimizer if some special steps need to be executed before or after ``optimizer.step()``.
+
+-   ``MetricsCalculator``: This module will take the data collected from the ``DataCollector``,
+    then calculate the metrics. The metric shape is usually reduced from the data shape.
+    The ``dim`` taken by ``MetricsCalculator`` means which dimension will be kept after calculate metrics.
+    i.e., the collected data shape is (10, 20, 30), and the ``dim`` is 1, then the dimension-1 will be kept,
+    the output metrics shape should be (20,).
+
+-   ``SparsityAllocator``: This module take the metrics and generate the masks.
+    Different ``SparsityAllocator`` has different masks generation strategies.
+    A common and simple strategy is sorting the metrics' values and calculating a threshold according to the configured sparsity,
+    mask the positions which metric value smaller than the threshold.
+    The ``dim`` taken by ``SparsityAllocator`` means the metrics are for which dimension, the mask will be expanded to weight shape.
+    i.e., the metric shape is (20,), the corresponding layer weight shape is (20, 40), and the ``dim`` is 0.
+    ``SparsityAllocator`` will first generate a mask with shape (20,), then expand this mask to shape (20, 40).
+
+Simple Example: Customize a Block-L1NormPruner
+----------------------------------------------
+
+NNI already have L1NormPruner, but for the reason of reproducing the paper and reducing user configuration items,
+it only support pruning layer output channels. In this example, we will customize a pruner that supports block granularity for Linear.
+
+Note that you don't need to implement all these three kinds of tools for each time,
+NNI supports many predefined tools, and you can directly use these to customize your own pruner.
+This is a tutorial so we show how to define all these three kinds of pruning tools.
+
+Customize the pruning tools used by the pruner at first.
+"""
+
+import torch
+from nni.algorithms.compression.v2.pytorch.pruning.basic_pruner import BasicPruner
+from nni.algorithms.compression.v2.pytorch.pruning.tools import (
+    DataCollector,
+    MetricsCalculator,
+    SparsityAllocator
+)
+
+
+# This data collector collects weight in wrapped module as data.
+# The wrapped module is the module configured in pruner's config_list.
+# This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.WeightDataCollector
+class WeightDataCollector(DataCollector):
+    def collect(self):
+        data = {}
+        # get_modules_wrapper will get all the wrapper in the compressor (pruner),
+        # it returns a dict with format {wrapper_name: wrapper},
+        # use wrapper.module to get the wrapped module.
+        for _, wrapper in self.compressor.get_modules_wrapper().items():
+            data[wrapper.name] = wrapper.module.weight.data
+        # return {wrapper_name: weight_data}
+        return data
+
+
+class BlockNormMetricsCalculator(MetricsCalculator):
+    def __init__(self, block_sparse_size):
+        # Because we will keep all dimension with block granularity, so fix ``dim=None``,
+        # means all dimensions will be kept.
+        super().__init__(dim=None, block_sparse_size=block_sparse_size)
+
+    def calculate_metrics(self, data):
+        data_length = len(self.block_sparse_size)
+        reduce_unfold_dims = list(range(data_length, 2 * data_length))
+
+        metrics = {}
+        for name, t in data.items():
+            # Unfold t as block size, and calculate L1 Norm for each block.
+            for dim, size in enumerate(self.block_sparse_size):
+                t = t.unfold(dim, size, size)
+            metrics[name] = t.norm(dim=reduce_unfold_dims, p=1)
+        # return {wrapper_name: block_metric}
+        return metrics
+
+
+# This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.NormalSparsityAllocator
+class BlockSparsityAllocator(SparsityAllocator):
+    def __init__(self, pruner, block_sparse_size):
+        super().__init__(pruner, dim=None, block_sparse_size=block_sparse_size, continuous_mask=True)
+
+    def generate_sparsity(self, metrics):
+        masks = {}
+        for name, wrapper in self.pruner.get_modules_wrapper().items():
+            # wrapper.config['total_sparsity'] can get the configured sparsity ratio for this wrapped module
+            sparsity_rate = wrapper.config['total_sparsity']
+            # get metric for this wrapped module
+            metric = metrics[name]
+            # mask the metric with old mask, if the masked position need never recover,
+            # just keep this is ok if you are new in NNI pruning
+            if self.continuous_mask:
+                metric *= self._compress_mask(wrapper.weight_mask)
+            # convert sparsity ratio to prune number
+            prune_num = int(sparsity_rate * metric.numel())
+            # calculate the metric threshold
+            threshold = torch.topk(metric.view(-1), prune_num, largest=False)[0].max()
+            # generate mask, keep the metric positions that metric values greater than the threshold
+            mask = torch.gt(metric, threshold).type_as(metric)
+            # expand the mask to weight size, if the block is masked, this block will be filled with zeros,
+            # otherwise filled with ones
+            masks[name] = self._expand_mask(name, mask)
+            # merge the new mask with old mask, if the masked position need never recover,
+            # just keep this is ok if you are new in NNI pruning
+            if self.continuous_mask:
+                masks[name]['weight'] *= wrapper.weight_mask
+        return masks
+
+
+# %%
+# Customize the pruner.
+
+class BlockL1NormPruner(BasicPruner):
+    def __init__(self, model, config_list, block_sparse_size):
+        self.block_sparse_size = block_sparse_size
+        super().__init__(model, config_list)
+
+    # Implement reset_tools is enough for this pruner.
+    def reset_tools(self):
+        if self.data_collector is None:
+            self.data_collector = WeightDataCollector(self)
+        else:
+            self.data_collector.reset()
+        if self.metrics_calculator is None:
+            self.metrics_calculator = BlockNormMetricsCalculator(self.block_sparse_size)
+        if self.sparsity_allocator is None:
+            self.sparsity_allocator = BlockSparsityAllocator(self, self.block_sparse_size)
+
+
+# %%
+# Try this pruner.
+
+# Define a simple model.
+class TestModel(torch.nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.fc1 = torch.nn.Linear(4, 8)
+        self.fc2 = torch.nn.Linear(8, 4)
+
+    def forward(self, x):
+        return self.fc2(self.fc1(x))
+
+model = TestModel()
+config_list = [{'op_types': ['Linear'], 'total_sparsity': 0.5}]
+# use 2x2 block
+_, masks = BlockL1NormPruner(model, config_list, [2, 2]).compress()
+
+# show the generated masks
+print('fc1 masks:\n', masks['fc1']['weight'])
+print('fc2 masks:\n', masks['fc2']['weight'])
+
+
+# %%
+# This time we successfully define a new pruner with pruning block granularity!
+# Note that we don't put validation logic in this example, like ``_validate_config_before_canonical``,
+# but for a robust implementation, we suggest you involve the validation logic.
--- a/docs/source/tutorials/pruning_customize.py.md5
+++ b/docs/source/tutorials/pruning_customize.py.md5
+5b92fe6666938105b07998c198077299
\ No newline at end of file
--- a/docs/source/tutorials/pruning_customize.rst
+++ b/docs/source/tutorials/pruning_customize.rst
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "tutorials/pruning_customize.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_customize.py>`
+        to download the full example code
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_tutorials_pruning_customize.py:
+
+
+Customize Basic Pruner
+======================
+
+Users can easily customize a basic pruner in NNI. A large number of basic modules have been provided and can be reused.
+Follow the NNI pruning interface, users only need to focus on their creative parts without worrying about other regular modules.
+
+In this tutorial, we show how to customize a basic pruner.
+
+Concepts
+--------
+
+NNI abstracts the basic pruning process into three steps, collecting data, calculating metrics, allocating sparsity.
+Most pruning algorithms rely on a metric to decide where should be pruned. Using L1 norm pruner as an example,
+the first step is collecting model weights, the second step is calculating L1 norm for weight per output channel,
+the third step is ranking L1 norm metric and masking the output channels that have small L1 norm.
+
+In NNI basic pruner, these three step is implement as ``DataCollector``, ``MetricsCalculator`` and ``SparsityAllocator``.
+
+-   ``DataCollector``: This module take pruner as initialize parameter.
+    It will get the relevant information of the model from the pruner,
+    and sometimes it will also hook the model to get input, output or gradient of a layer or a tensor.
+    It can also patch optimizer if some special steps need to be executed before or after ``optimizer.step()``.
+
+-   ``MetricsCalculator``: This module will take the data collected from the ``DataCollector``,
+    then calculate the metrics. The metric shape is usually reduced from the data shape.
+    The ``dim`` taken by ``MetricsCalculator`` means which dimension will be kept after calculate metrics.
+    i.e., the collected data shape is (10, 20, 30), and the ``dim`` is 1, then the dimension-1 will be kept,
+    the output metrics shape should be (20,).
+
+-   ``SparsityAllocator``: This module take the metrics and generate the masks.
+    Different ``SparsityAllocator`` has different masks generation strategies.
+    A common and simple strategy is sorting the metrics' values and calculating a threshold according to the configured sparsity,
+    mask the positions which metric value smaller than the threshold.
+    The ``dim`` taken by ``SparsityAllocator`` means the metrics are for which dimension, the mask will be expanded to weight shape.
+    i.e., the metric shape is (20,), the corresponding layer weight shape is (20, 40), and the ``dim`` is 0.
+    ``SparsityAllocator`` will first generate a mask with shape (20,), then expand this mask to shape (20, 40).
+
+Simple Example: Customize a Block-L1NormPruner
+----------------------------------------------
+
+NNI already have L1NormPruner, but for the reason of reproducing the paper and reducing user configuration items,
+it only support pruning layer output channels. In this example, we will customize a pruner that supports block granularity for Linear.
+
+Note that you don't need to implement all these three kinds of tools for each time,
+NNI supports many predefined tools, and you can directly use these to customize your own pruner.
+This is a tutorial so we show how to define all these three kinds of pruning tools.
+
+Customize the pruning tools used by the pruner at first.
+
+.. GENERATED FROM PYTHON SOURCE LINES 51-128
+
+.. code-block:: default
+
+
+    import torch
+    from nni.algorithms.compression.v2.pytorch.pruning.basic_pruner import BasicPruner
+    from nni.algorithms.compression.v2.pytorch.pruning.tools import (
+        DataCollector,
+        MetricsCalculator,
+        SparsityAllocator
+    )
+
+
+    # This data collector collects weight in wrapped module as data.
+    # The wrapped module is the module configured in pruner's config_list.
+    # This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.WeightDataCollector
+    class WeightDataCollector(DataCollector):
+        def collect(self):
+            data = {}
+            # get_modules_wrapper will get all the wrapper in the compressor (pruner),
+            # it returns a dict with format {wrapper_name: wrapper},
+            # use wrapper.module to get the wrapped module.
+            for _, wrapper in self.compressor.get_modules_wrapper().items():
+                data[wrapper.name] = wrapper.module.weight.data
+            # return {wrapper_name: weight_data}
+            return data
+
+
+    class BlockNormMetricsCalculator(MetricsCalculator):
+        def __init__(self, block_sparse_size):
+            # Because we will keep all dimension with block granularity, so fix ``dim=None``,
+            # means all dimensions will be kept.
+            super().__init__(dim=None, block_sparse_size=block_sparse_size)
+
+        def calculate_metrics(self, data):
+            data_length = len(self.block_sparse_size)
+            reduce_unfold_dims = list(range(data_length, 2 * data_length))
+
+            metrics = {}
+            for name, t in data.items():
+                # Unfold t as block size, and calculate L1 Norm for each block.
+                for dim, size in enumerate(self.block_sparse_size):
+                    t = t.unfold(dim, size, size)
+                metrics[name] = t.norm(dim=reduce_unfold_dims, p=1)
+            # return {wrapper_name: block_metric}
+            return metrics
+
+
+    # This implementation is similar as nni.algorithms.compression.v2.pytorch.pruning.tools.NormalSparsityAllocator
+    class BlockSparsityAllocator(SparsityAllocator):
+        def __init__(self, pruner, block_sparse_size):
+            super().__init__(pruner, dim=None, block_sparse_size=block_sparse_size, continuous_mask=True)
+
+        def generate_sparsity(self, metrics):
+            masks = {}
+            for name, wrapper in self.pruner.get_modules_wrapper().items():
+                # wrapper.config['total_sparsity'] can get the configured sparsity ratio for this wrapped module
+                sparsity_rate = wrapper.config['total_sparsity']
+                # get metric for this wrapped module
+                metric = metrics[name]
+                # mask the metric with old mask, if the masked position need never recover,
+                # just keep this is ok if you are new in NNI pruning
+                if self.continuous_mask:
+                    metric *= self._compress_mask(wrapper.weight_mask)
+                # convert sparsity ratio to prune number
+                prune_num = int(sparsity_rate * metric.numel())
+                # calculate the metric threshold
+                threshold = torch.topk(metric.view(-1), prune_num, largest=False)[0].max()
+                # generate mask, keep the metric positions that metric values greater than the threshold
+                mask = torch.gt(metric, threshold).type_as(metric)
+                # expand the mask to weight size, if the block is masked, this block will be filled with zeros,
+                # otherwise filled with ones
+                masks[name] = self._expand_mask(name, mask)
+                # merge the new mask with old mask, if the masked position need never recover,
+                # just keep this is ok if you are new in NNI pruning
+                if self.continuous_mask:
+                    masks[name]['weight'] *= wrapper.weight_mask
+            return masks
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 129-130
+
+Customize the pruner.
+
+.. GENERATED FROM PYTHON SOURCE LINES 130-148
+
+.. code-block:: default
+
+
+    class BlockL1NormPruner(BasicPruner):
+        def __init__(self, model, config_list, block_sparse_size):
+            self.block_sparse_size = block_sparse_size
+            super().__init__(model, config_list)
+
+        # Implement reset_tools is enough for this pruner.
+        def reset_tools(self):
+            if self.data_collector is None:
+                self.data_collector = WeightDataCollector(self)
+            else:
+                self.data_collector.reset()
+            if self.metrics_calculator is None:
+                self.metrics_calculator = BlockNormMetricsCalculator(self.block_sparse_size)
+            if self.sparsity_allocator is None:
+                self.sparsity_allocator = BlockSparsityAllocator(self, self.block_sparse_size)
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 149-150
+
+Try this pruner.
+
+.. GENERATED FROM PYTHON SOURCE LINES 150-171
+
+.. code-block:: default
+
+
+    # Define a simple model.
+    class TestModel(torch.nn.Module):
+        def __init__(self) -> None:
+            super().__init__()
+            self.fc1 = torch.nn.Linear(4, 8)
+            self.fc2 = torch.nn.Linear(8, 4)
+
+        def forward(self, x):
+            return self.fc2(self.fc1(x))
+
+    model = TestModel()
+    config_list = [{'op_types': ['Linear'], 'total_sparsity': 0.5}]
+    # use 2x2 block
+    _, masks = BlockL1NormPruner(model, config_list, [2, 2]).compress()
+
+    # show the generated masks
+    print('fc1 masks:\n', masks['fc1']['weight'])
+    print('fc2 masks:\n', masks['fc2']['weight'])
+
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    fc1 masks:
+     tensor([[0., 0., 0., 0.],
+            [0., 0., 0., 0.],
+            [0., 0., 0., 0.],
+            [0., 0., 0., 0.],
+            [1., 1., 1., 1.],
+            [1., 1., 1., 1.],
+            [1., 1., 1., 1.],
+            [1., 1., 1., 1.]])
+    fc2 masks:
+     tensor([[0., 0., 0., 0., 1., 1., 1., 1.],
+            [0., 0., 0., 0., 1., 1., 1., 1.],
+            [0., 0., 0., 0., 1., 1., 1., 1.],
+            [0., 0., 0., 0., 1., 1., 1., 1.]])
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 172-175
+
+This time we successfully define a new pruner with pruning block granularity!
+Note that we don't put validation logic in this example, like ``_validate_config_before_canonical``,
+but for a robust implementation, we suggest you involve the validation logic.
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** ( 0 minutes  1.175 seconds)
+
+
+.. _sphx_glr_download_tutorials_pruning_customize.py:
+
+
+.. only :: html
+
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-python
+
+     :download:`Download Python source code: pruning_customize.py <pruning_customize.py>`
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+     :download:`Download Jupyter notebook: pruning_customize.ipynb <pruning_customize.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/pruning_customize_codeobj.pickle
+++ b/docs/source/tutorials/pruning_customize_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist.ipynb
+++ b/docs/source/tutorials/pruning_quick_start_mnist.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Preparation\n\nIn this tutorial, we use a simple model and pre-trained on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\n\nfrom scripts.compression_mnist_model import TorchModel, trainer, evaluator, device\n\n# define the model\nmodel = TorchModel().to(device)\n\n# show the model structure, note that pruner will wrap the model layer.\nprint(model)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# define the optimizer and criterion for pre-training\n\noptimizer = SGD(model.parameters(), 1e-2)\ncriterion = F.nll_loss\n\n# pre-train and evaluate the model on MNIST dataset\nfor epoch in range(3):\n    trainer(model, optimizer, criterion)\n    evaluator(model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Pruning Model\n\nUsing L1NormPruner to prune the model and generate the masks.\nUsually, a pruner requires original model and ``config_list`` as its inputs.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.\n\nThe following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "config_list = [{\n    'sparsity_per_layer': 0.5,\n    'op_types': ['Linear', 'Conv2d']\n}, {\n    'exclude': True,\n    'op_names': ['fc3']\n}]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Pruners usually require `model` and `config_list` as input arguments.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from nni.compression.pytorch.pruning import L1NormPruner\npruner = L1NormPruner(model, config_list)\n\n# show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.\nprint(model)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# compress the model and generate the masks\n_, masks = pruner.compress()\n# show the masks sparsity\nfor name, mask in masks.items():\n    print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.\nThe model becomes smaller after speedup,\nand reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "the model will become real smaller after speedup\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "print(model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Fine-tuning Compacted Model\nNote that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.\nBecause speedup will replace the masked big layers with dense small ones.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "optimizer = SGD(model.parameters(), 1e-2)\nfor epoch in range(3):\n    trainer(model, optimizer, criterion)"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.7"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/docs/source/tutorials/pruning_quick_start_mnist.py
+++ b/docs/source/tutorials/pruning_quick_start_mnist.py
+"""
+Pruning Quickstart
+==================
+
+Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
+There are three common practices for pruning a DNN model:
+
+#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model
+#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model
+#. Pruning a model -> Training the pruned model from scratch
+
+NNI supports all of the above pruning practices by working on the key pruning stage.
+Following this tutorial for a quick look at how to use NNI to prune a model in a common practice.
+"""
+
+# %%
+# Preparation
+# -----------
+#
+# In this tutorial, we use a simple model and pre-trained on MNIST dataset.
+# If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
+
+import torch
+import torch.nn.functional as F
+from torch.optim import SGD
+
+from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+
+# define the model
+model = TorchModel().to(device)
+
+# show the model structure, note that pruner will wrap the model layer.
+print(model)
+
+# %%
+
+# define the optimizer and criterion for pre-training
+
+optimizer = SGD(model.parameters(), 1e-2)
+criterion = F.nll_loss
+
+# pre-train and evaluate the model on MNIST dataset
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
+    evaluator(model)
+
+# %%
+# Pruning Model
+# -------------
+#
+# Using L1NormPruner to prune the model and generate the masks.
+# Usually, a pruner requires original model and ``config_list`` as its inputs.
+# Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+#
+# The following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
+# except the layer named `fc3`, because `fc3` is `exclude`.
+# The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
+
+config_list = [{
+    'sparsity_per_layer': 0.5,
+    'op_types': ['Linear', 'Conv2d']
+}, {
+    'exclude': True,
+    'op_names': ['fc3']
+}]
+
+# %%
+# Pruners usually require `model` and `config_list` as input arguments.
+
+from nni.compression.pytorch.pruning import L1NormPruner
+pruner = L1NormPruner(model, config_list)
+
+# show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
+print(model)
+
+# %%
+
+# compress the model and generate the masks
+_, masks = pruner.compress()
+# show the masks sparsity
+for name, mask in masks.items():
+    print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
+
+# %%
+# Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
+# The model becomes smaller after speedup,
+# and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
+
+# need to unwrap the model, if the model is wrapped before speedup
+pruner._unwrap_model()
+
+# speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.
+from nni.compression.pytorch.speedup import ModelSpeedup
+
+ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
+
+# %%
+# the model will become real smaller after speedup
+print(model)
+
+# %%
+# Fine-tuning Compacted Model
+# ---------------------------
+# Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
+# Because speedup will replace the masked big layers with dense small ones.
+
+optimizer = SGD(model.parameters(), 1e-2)
+for epoch in range(3):
+    trainer(model, optimizer, criterion)
--- a/docs/source/tutorials/pruning_quick_start_mnist.py.md5
+++ b/docs/source/tutorials/pruning_quick_start_mnist.py.md5
+930f8ee2f57b70037e3231152a72606c
\ No newline at end of file
--- a/docs/source/tutorials/pruning_quick_start_mnist.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist.rst
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "tutorials/pruning_quick_start_mnist.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start_mnist.py>`
+        to download the full example code
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_tutorials_pruning_quick_start_mnist.py:
+
+
+Pruning Quickstart
+==================
+
+Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
+There are three common practices for pruning a DNN model:
+
+#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model
+#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model
+#. Pruning a model -> Training the pruned model from scratch
+
+NNI supports all of the above pruning practices by working on the key pruning stage.
+Following this tutorial for a quick look at how to use NNI to prune a model in a common practice.
+
+.. GENERATED FROM PYTHON SOURCE LINES 17-22
+
+Preparation
+-----------
+
+In this tutorial, we use a simple model and pre-trained on MNIST dataset.
+If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
+
+.. GENERATED FROM PYTHON SOURCE LINES 22-35
+
+.. code-block:: default
+
+
+    import torch
+    import torch.nn.functional as F
+    from torch.optim import SGD
+
+    from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+
+    # define the model
+    model = TorchModel().to(device)
+
+    # show the model structure, note that pruner will wrap the model layer.
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=256, out_features=120, bias=True)
+      (fc2): Linear(in_features=120, out_features=84, bias=True)
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 36-47
+
+.. code-block:: default
+
+
+    # define the optimizer and criterion for pre-training
+
+    optimizer = SGD(model.parameters(), 1e-2)
+    criterion = F.nll_loss
+
+    # pre-train and evaluate the model on MNIST dataset
+    for epoch in range(3):
+        trainer(model, optimizer, criterion)
+        evaluator(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Average test loss: 0.5368, Accuracy: 8321/10000 (83%)
+    Average test loss: 0.3092, Accuracy: 9104/10000 (91%)
+    Average test loss: 0.2070, Accuracy: 9380/10000 (94%)
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 48-58
+
+Pruning Model
+-------------
+
+Using L1NormPruner to prune the model and generate the masks.
+Usually, a pruner requires original model and ``config_list`` as its inputs.
+Detailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.
+
+The following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
+except the layer named `fc3`, because `fc3` is `exclude`.
+The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
+
+.. GENERATED FROM PYTHON SOURCE LINES 58-67
+
+.. code-block:: default
+
+
+    config_list = [{
+        'sparsity_per_layer': 0.5,
+        'op_types': ['Linear', 'Conv2d']
+    }, {
+        'exclude': True,
+        'op_names': ['fc3']
+    }]
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 68-69
+
+Pruners usually require `model` and `config_list` as input arguments.
+
+.. GENERATED FROM PYTHON SOURCE LINES 69-76
+
+.. code-block:: default
+
+
+    from nni.compression.pytorch.pruning import L1NormPruner
+    pruner = L1NormPruner(model, config_list)
+
+    # show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): PrunerModuleWrapper(
+        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (conv2): PrunerModuleWrapper(
+        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (fc1): PrunerModuleWrapper(
+        (module): Linear(in_features=256, out_features=120, bias=True)
+      )
+      (fc2): PrunerModuleWrapper(
+        (module): Linear(in_features=120, out_features=84, bias=True)
+      )
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 77-84
+
+.. code-block:: default
+
+
+    # compress the model and generate the masks
+    _, masks = pruner.compress()
+    # show the masks sparsity
+    for name, mask in masks.items():
+        print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    conv1  sparsity :  0.5
+    conv2  sparsity :  0.5
+    fc1  sparsity :  0.5
+    fc2  sparsity :  0.5
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 85-88
+
+Speedup the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
+The model becomes smaller after speedup,
+and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
+
+.. GENERATED FROM PYTHON SOURCE LINES 88-97
+
+.. code-block:: default
+
+
+    # need to unwrap the model, if the model is wrapped before speedup
+    pruner._unwrap_model()
+
+    # speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.
+    from nni.compression.pytorch.speedup import ModelSpeedup
+
+    ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
+    Note: .aten::log_softmax.12 does not have corresponding mask inference object
+    /home/nishang/anaconda3/envs/MCM/lib/python3.9/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811803361/work/build/aten/src/ATen/core/TensorBody.h:417.)
+      return self._grad
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 98-99
+
+the model will become real smaller after speedup
+
+.. GENERATED FROM PYTHON SOURCE LINES 99-101
+
+.. code-block:: default
+
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=128, out_features=60, bias=True)
+      (fc2): Linear(in_features=60, out_features=42, bias=True)
+      (fc3): Linear(in_features=42, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 102-106
+
+Fine-tuning Compacted Model
+---------------------------
+Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
+Because speedup will replace the masked big layers with dense small ones.
+
+.. GENERATED FROM PYTHON SOURCE LINES 106-110
+
+.. code-block:: default
+
+
+    optimizer = SGD(model.parameters(), 1e-2)
+    for epoch in range(3):
+        trainer(model, optimizer, criterion)
+
+
+
+
+
+
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** ( 0 minutes  58.337 seconds)
+
+
+.. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:
+
+
+.. only :: html
+
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-python
+
+     :download:`Download Python source code: pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+     :download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb <pruning_quick_start_mnist.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
+++ b/docs/source/tutorials/pruning_quick_start_mnist_codeobj.pickle
--- a/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
+++ b/docs/source/tutorials/pruning_quick_start_mnist_zh.rst
+.. 5f266ace988c9ca9e44555fdc497e9ba
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start_mnist.py>`
+        to download the full example code
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_tutorials_pruning_quick_start_mnist.py:
+
+
+模型剪枝入门
+============
+
+模型剪枝是一种通过减小模型权重规模或中间状态规模来减小模型大小和计算量的技术。
+修剪 DNN 模型有三种常见做法：
+
+#. 训练一个模型 -> 对模型进行剪枝 -> 对剪枝后模型进行微调
+#. 在模型训练过程中进行剪枝 -> 对剪枝后模型进行微调
+#. 对模型进行剪枝 -> 从头训练剪枝后模型
+
+NNI 主要通过在剪枝阶段进行工作来支持上述所有剪枝过程。
+通过本教程可以快速了解如何在常见实践中使用 NNI 修剪模型。
+
+.. GENERATED FROM PYTHON SOURCE LINES 17-22
+
+准备工作
+--------
+
+在本教程中，我们使用一个简单的模型在 MNIST 数据集上进行了预训练。
+如果你熟悉在 pytorch 中定义模型和训练模型，可以直接跳到 `模型剪枝`_。
+
+.. GENERATED FROM PYTHON SOURCE LINES 22-35
+
+.. code-block:: default
+
+
+    import torch
+    import torch.nn.functional as F
+    from torch.optim import SGD
+
+    from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
+
+    # define the model
+    model = TorchModel().to(device)
+
+    # show the model structure, note that pruner will wrap the model layer.
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=256, out_features=120, bias=True)
+      (fc2): Linear(in_features=120, out_features=84, bias=True)
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 36-47
+
+.. code-block:: default
+
+
+    # define the optimizer and criterion for pre-training
+
+    optimizer = SGD(model.parameters(), 1e-2)
+    criterion = F.nll_loss
+
+    # pre-train and evaluate the model on MNIST dataset
+    for epoch in range(3):
+        trainer(model, optimizer, criterion)
+        evaluator(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Average test loss: 0.5266, Accuracy: 8345/10000 (83%)
+    Average test loss: 0.2713, Accuracy: 9209/10000 (92%)
+    Average test loss: 0.1919, Accuracy: 9356/10000 (94%)
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 48-58
+
+模型剪枝
+--------
+
+使用 L1NormPruner 对模型进行剪枝并生成掩码。
+通常情况下，pruner 需要原始模型和一个 ``config_list`` 作为输入参数。
+具体关于如何写 ``config_list`` 请参考 :doc:`compression config specification <../compression/compression_config_list>`。
+
+以下 `config_list` 表示 pruner 将修剪类型为 `Linear` 或 `Conv2d` 的所有层除了名为 `fc3` 的层，因为 `fc3` 被设置为 `exclude`。
+每层的最终稀疏率是 50%。而名为 `fc3` 的层将不会被修剪。
+
+.. GENERATED FROM PYTHON SOURCE LINES 58-67
+
+.. code-block:: default
+
+
+    config_list = [{
+        'sparsity_per_layer': 0.5,
+        'op_types': ['Linear', 'Conv2d']
+    }, {
+        'exclude': True,
+        'op_names': ['fc3']
+    }]
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 68-69
+
+Pruners usually require `model` and `config_list` as input arguments.
+
+.. GENERATED FROM PYTHON SOURCE LINES 69-76
+
+.. code-block:: default
+
+
+    from nni.compression.pytorch.pruning import L1NormPruner
+    pruner = L1NormPruner(model, config_list)
+
+    # show the wrapped model structure, `PrunerModuleWrapper` have wrapped the layers that configured in the config_list.
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): PrunerModuleWrapper(
+        (module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (conv2): PrunerModuleWrapper(
+        (module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      )
+      (fc1): PrunerModuleWrapper(
+        (module): Linear(in_features=256, out_features=120, bias=True)
+      )
+      (fc2): PrunerModuleWrapper(
+        (module): Linear(in_features=120, out_features=84, bias=True)
+      )
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 77-84
+
+.. code-block:: default
+
+
+    # compress the model and generate the masks
+    _, masks = pruner.compress()
+    # show the masks sparsity
+    for name, mask in masks.items():
+        print(name, ' sparsity : ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    conv1  sparsity :  0.5
+    conv2  sparsity :  0.5
+    fc1  sparsity :  0.5
+    fc2  sparsity :  0.5
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 85-88
+
+使用 NNI 的模型加速功能和 pruner 生成好的 masks 对原始模型进行加速，注意 `ModelSpeedup` 需要 unwrapped 的模型。
+模型会在加速之后真正的在规模上变小，并且可能会达到相比于 masks 更大的稀疏率，这是因为 `ModelSpeedup` 会自动在模型中传播稀疏，
+识别由于掩码带来的冗余权重。
+
+.. GENERATED FROM PYTHON SOURCE LINES 88-97
+
+.. code-block:: default
+
+
+    # need to unwrap the model, if the model is wrapped before speedup
+    pruner._unwrap_model()
+
+    # speedup the model
+    from nni.compression.pytorch.speedup import ModelSpeedup
+
+    ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
+    Note: .aten::log_softmax.12 does not have corresponding mask inference object
+    /home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten/src/ATen/core/TensorBody.h:417.)
+      return self._grad
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 98-99
+
+模型在加速之后变小了。
+
+.. GENERATED FROM PYTHON SOURCE LINES 99-101
+
+.. code-block:: default
+
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=128, out_features=60, bias=True)
+      (fc2): Linear(in_features=60, out_features=42, bias=True)
+      (fc3): Linear(in_features=42, out_features=10, bias=True)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 102-106
+
+微调压缩好的紧凑模型
+--------------------
+
+注意当前的模型已经经过了加速，如果你需要微调模型，请重新生成 optimizer。
+这是因为在加速过程中进行了层替换，原来的 optimizer 已经不适用于现在的新模型了。
+
+.. GENERATED FROM PYTHON SOURCE LINES 106-110
+
+.. code-block:: default
+
+
+    optimizer = SGD(model.parameters(), 1e-2)
+    for epoch in range(3):
+        trainer(model, optimizer, criterion)
+
+
+
+
+
+
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** ( 1 minutes  24.976 seconds)
+
+
+.. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:
+
+
+.. only :: html
+
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-python
+
+     :download:`Download Python source code: pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+     :download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb <pruning_quick_start_mnist.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/pruning_speedup.ipynb
+++ b/docs/source/tutorials/pruning_speedup.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Speedup Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Generate a mask for the model at first.\nWe usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.\nBut in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import torch\nfrom scripts.compression_mnist_model import TorchModel, device\n\nmodel = TorchModel().to(device)\n# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}\nconv1_mask = torch.ones_like(model.conv1.weight.data)\n# mask the first three output channels in conv1\nconv1_mask[0: 3] = 0\nmasks = {'conv1': {'weight': conv1_mask}}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Show the original model structure.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "print(model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Roughly test the original model inference speed.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import time\nstart = time.time()\nmodel(torch.rand(128, 1, 28, 28).to(device))\nprint('Original Model - Elapsed Time : ', time.time() - start)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Speedup the model and show the model structure after speedup.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from nni.compression.pytorch import ModelSpeedup\nModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()\nprint(model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Roughly test the model after speedup inference speed.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "start = time.time()\nmodel(torch.rand(128, 1, 28, 28).to(device))\nprint('Speedup Model - Elapsed Time : ', time.time() - start)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,\nplease refer to :doc:`Pruning Quick Start <pruning_quick_start_mnist>`.\n\nNOTE: The current implementation supports PyTorch 1.3.1 or newer.\n\n## Limitations\n\nFor PyTorch we can only replace modules, if functions in ``forward`` should be replaced,\nour current implementation does not work. One workaround is make the function a PyTorch module.\n\nIf you want to speedup your own model which cannot supported by the current implementation,\nyou need implement the replace function for module replacement, welcome to contribute.\n\n## Speedup Results of Examples\n\nThe code of these experiments can be found :githublink:`here <examples/model_compress/pruning/legacy/speedup/model_speedup.py>`.\n\nThese result are tested on the `legacy pruning framework <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_, new results will coming soon.\n\n### slim pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01197\n     - 0.005107\n   * - 2\n     - 0.02019\n     - 0.008769\n   * - 4\n     - 0.02733\n     - 0.014809\n   * - 8\n     - 0.04310\n     - 0.027441\n   * - 16\n     - 0.07731\n     - 0.05008\n   * - 32\n     - 0.14464\n     - 0.10027\n\n### fpgm pruner example\n\non cpu,\ninput tensor: ``torch.randn(64, 1, 28, 28)``\\ ,\ntoo large variance\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01383\n     - 0.01839\n   * - 2\n     - 0.01167\n     - 0.003558\n   * - 4\n     - 0.01636\n     - 0.01088\n   * - 40\n     - 0.14412\n     - 0.08268\n   * - 40\n     - 1.29385\n     - 0.14408\n   * - 40\n     - 0.41035\n     - 0.46162\n   * - 400\n     - 6.29020\n     - 5.82143\n\n### l1filter pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01026\n     - 0.003677\n   * - 2\n     - 0.01657\n     - 0.008161\n   * - 4\n     - 0.02458\n     - 0.020018\n   * - 8\n     - 0.03498\n     - 0.025504\n   * - 16\n     - 0.06757\n     - 0.047523\n   * - 32\n     - 0.10487\n     - 0.086442\n\n### APoZ pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n   :header-rows: 1\n   :widths: auto\n\n   * - Times\n     - Mask Latency\n     - Speedup Latency\n   * - 1\n     - 0.01389\n     - 0.004208\n   * - 2\n     - 0.01628\n     - 0.008310\n   * - 4\n     - 0.02521\n     - 0.014008\n   * - 8\n     - 0.03386\n     - 0.023923\n   * - 16\n     - 0.06042\n     - 0.046183\n   * - 32\n     - 0.12421\n     - 0.087113\n\n### SimulatedAnnealing pruner example\n\nIn this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.\nWe measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.\nThe latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.\n\n<img src=\"file://../../img/SA_latency_accuracy.png\">\n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.7"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/docs/source/tutorials/pruning_speedup.py
+++ b/docs/source/tutorials/pruning_speedup.py
+"""
+Speedup Model with Mask
+========================
+
+Introduction
+------------
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
+and input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.
+The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
+To speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+Since the support of sparse kernels in community is limited,
+we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+Design and Implementation
+-------------------------
+
+To speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
+thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
+Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
+second, replace the modules.
+
+The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
+For each type of module, we should prepare a function for module replacement.
+The module replacement function returns a newly created module which is smaller.
+
+Usage
+-----
+
+"""
+
+# %%
+# Generate a mask for the model at first.
+# We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
+# But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
+
+import torch
+from scripts.compression_mnist_model import TorchModel, device
+
+model = TorchModel().to(device)
+# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
+conv1_mask = torch.ones_like(model.conv1.weight.data)
+# mask the first three output channels in conv1
+conv1_mask[0: 3] = 0
+masks = {'conv1': {'weight': conv1_mask}}
+
+# %%
+# Show the original model structure.
+print(model)
+
+# %%
+# Roughly test the original model inference speed.
+import time
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Original Model - Elapsed Time : ', time.time() - start)
+
+# %%
+# Speedup the model and show the model structure after speedup.
+from nni.compression.pytorch import ModelSpeedup
+ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
+print(model)
+
+# %%
+# Roughly test the model after speedup inference speed.
+start = time.time()
+model(torch.rand(128, 1, 28, 28).to(device))
+print('Speedup Model - Elapsed Time : ', time.time() - start)
+
+# %%
+# For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
+# please refer to :doc:`Pruning Quick Start <pruning_quick_start_mnist>`.
+#
+# NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+#
+# Limitations
+# -----------
+#
+# For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
+# our current implementation does not work. One workaround is make the function a PyTorch module.
+#
+# If you want to speedup your own model which cannot supported by the current implementation,
+# you need implement the replace function for module replacement, welcome to contribute.
+#
+# Speedup Results of Examples
+# ---------------------------
+#
+# The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/legacy/speedup/model_speedup.py>`.
+#
+# These result are tested on the `legacy pruning framework <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_, new results will coming soon.
+#
+# slim pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01197
+#      - 0.005107
+#    * - 2
+#      - 0.02019
+#      - 0.008769
+#    * - 4
+#      - 0.02733
+#      - 0.014809
+#    * - 8
+#      - 0.04310
+#      - 0.027441
+#    * - 16
+#      - 0.07731
+#      - 0.05008
+#    * - 32
+#      - 0.14464
+#      - 0.10027
+#
+# fpgm pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on cpu,
+# input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+# too large variance
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01383
+#      - 0.01839
+#    * - 2
+#      - 0.01167
+#      - 0.003558
+#    * - 4
+#      - 0.01636
+#      - 0.01088
+#    * - 40
+#      - 0.14412
+#      - 0.08268
+#    * - 40
+#      - 1.29385
+#      - 0.14408
+#    * - 40
+#      - 0.41035
+#      - 0.46162
+#    * - 400
+#      - 6.29020
+#      - 5.82143
+#
+# l1filter pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01026
+#      - 0.003677
+#    * - 2
+#      - 0.01657
+#      - 0.008161
+#    * - 4
+#      - 0.02458
+#      - 0.020018
+#    * - 8
+#      - 0.03498
+#      - 0.025504
+#    * - 16
+#      - 0.06757
+#      - 0.047523
+#    * - 32
+#      - 0.10487
+#      - 0.086442
+#
+# APoZ pruner example
+# ^^^^^^^^^^^^^^^^^^^
+#
+# on one V100 GPU,
+# input tensor: ``torch.randn(64, 3, 32, 32)``
+#
+# .. list-table::
+#    :header-rows: 1
+#    :widths: auto
+#
+#    * - Times
+#      - Mask Latency
+#      - Speedup Latency
+#    * - 1
+#      - 0.01389
+#      - 0.004208
+#    * - 2
+#      - 0.01628
+#      - 0.008310
+#    * - 4
+#      - 0.02521
+#      - 0.014008
+#    * - 8
+#      - 0.03386
+#      - 0.023923
+#    * - 16
+#      - 0.06042
+#      - 0.046183
+#    * - 32
+#      - 0.12421
+#      - 0.087113
+#
+# SimulatedAnnealing pruner example
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
+# We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
+# The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.
+#
+# .. image:: ../../img/SA_latency_accuracy.png
--- a/docs/source/tutorials/pruning_speedup.py.md5
+++ b/docs/source/tutorials/pruning_speedup.py.md5
+dc5c2369666206591238118f0f746e46
\ No newline at end of file
--- a/docs/source/tutorials/pruning_speedup.rst
+++ b/docs/source/tutorials/pruning_speedup.rst
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "tutorials/pruning_speedup.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        Click :ref:`here <sphx_glr_download_tutorials_pruning_speedup.py>`
+        to download the full example code
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_tutorials_pruning_speedup.py:
+
+
+Speedup Model with Mask
+========================
+
+Introduction
+------------
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
+and input/output tensors. Sparse kernel is required to speedup a fine-grained pruned layer.
+The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
+To speedup this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
+Since the support of sparse kernels in community is limited,
+we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+Design and Implementation
+-------------------------
+
+To speedup a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
+or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
+thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
+Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
+second, replace the modules.
+
+The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
+For each type of module, we should prepare a function for module replacement.
+The module replacement function returns a newly created module which is smaller.
+
+Usage
+-----
+
+.. GENERATED FROM PYTHON SOURCE LINES 41-44
+
+Generate a mask for the model at first.
+We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
+But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
+
+.. GENERATED FROM PYTHON SOURCE LINES 44-55
+
+.. code-block:: default
+
+
+    import torch
+    from scripts.compression_mnist_model import TorchModel, device
+
+    model = TorchModel().to(device)
+    # masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
+    conv1_mask = torch.ones_like(model.conv1.weight.data)
+    # mask the first three output channels in conv1
+    conv1_mask[0: 3] = 0
+    masks = {'conv1': {'weight': conv1_mask}}
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 56-57
+
+Show the original model structure.
+
+.. GENERATED FROM PYTHON SOURCE LINES 57-59
+
+.. code-block:: default
+
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    TorchModel(
+      (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=256, out_features=120, bias=True)
+      (fc2): Linear(in_features=120, out_features=84, bias=True)
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 60-61
+
+Roughly test the original model inference speed.
+
+.. GENERATED FROM PYTHON SOURCE LINES 61-66
+
+.. code-block:: default
+
+    import time
+    start = time.time()
+    model(torch.rand(128, 1, 28, 28).to(device))
+    print('Original Model - Elapsed Time : ', time.time() - start)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Original Model - Elapsed Time :  0.5094916820526123
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 67-68
+
+Speedup the model and show the model structure after speedup.
+
+.. GENERATED FROM PYTHON SOURCE LINES 68-72
+
+.. code-block:: default
+
+    from nni.compression.pytorch import ModelSpeedup
+    ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
+    print(model)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
+    Note: .aten::log_softmax.12 does not have corresponding mask inference object
+    /home/nishang/anaconda3/envs/MCM/lib/python3.9/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811803361/work/build/aten/src/ATen/core/TensorBody.h:417.)
+      return self._grad
+    TorchModel(
+      (conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
+      (conv2): Conv2d(3, 16, kernel_size=(5, 5), stride=(1, 1))
+      (fc1): Linear(in_features=256, out_features=120, bias=True)
+      (fc2): Linear(in_features=120, out_features=84, bias=True)
+      (fc3): Linear(in_features=84, out_features=10, bias=True)
+      (relu1): ReLU()
+      (relu2): ReLU()
+      (relu3): ReLU()
+      (relu4): ReLU()
+      (pool1): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+      (pool2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
+    )
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 73-74
+
+Roughly test the model after speedup inference speed.
+
+.. GENERATED FROM PYTHON SOURCE LINES 74-78
+
+.. code-block:: default
+
+    start = time.time()
+    model(torch.rand(128, 1, 28, 28).to(device))
+    print('Speedup Model - Elapsed Time : ', time.time() - start)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ Out:
+
+ .. code-block:: none
+
+    Speedup Model - Elapsed Time :  0.006000041961669922
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 79-240
+
+For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
+please refer to :doc:`Pruning Quick Start <pruning_quick_start_mnist>`.
+
+NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+
+Limitations
+-----------
+
+For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
+our current implementation does not work. One workaround is make the function a PyTorch module.
+
+If you want to speedup your own model which cannot supported by the current implementation,
+you need implement the replace function for module replacement, welcome to contribute.
+
+Speedup Results of Examples
+---------------------------
+
+The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/legacy/speedup/model_speedup.py>`.
+
+These result are tested on the `legacy pruning framework <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_, new results will coming soon.
+
+slim pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01197
+     - 0.005107
+   * - 2
+     - 0.02019
+     - 0.008769
+   * - 4
+     - 0.02733
+     - 0.014809
+   * - 8
+     - 0.04310
+     - 0.027441
+   * - 16
+     - 0.07731
+     - 0.05008
+   * - 32
+     - 0.14464
+     - 0.10027
+
+fpgm pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on cpu,
+input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+too large variance
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01383
+     - 0.01839
+   * - 2
+     - 0.01167
+     - 0.003558
+   * - 4
+     - 0.01636
+     - 0.01088
+   * - 40
+     - 0.14412
+     - 0.08268
+   * - 40
+     - 1.29385
+     - 0.14408
+   * - 40
+     - 0.41035
+     - 0.46162
+   * - 400
+     - 6.29020
+     - 5.82143
+
+l1filter pruner example
+^^^^^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01026
+     - 0.003677
+   * - 2
+     - 0.01657
+     - 0.008161
+   * - 4
+     - 0.02458
+     - 0.020018
+   * - 8
+     - 0.03498
+     - 0.025504
+   * - 16
+     - 0.06757
+     - 0.047523
+   * - 32
+     - 0.10487
+     - 0.086442
+
+APoZ pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01389
+     - 0.004208
+   * - 2
+     - 0.01628
+     - 0.008310
+   * - 4
+     - 0.02521
+     - 0.014008
+   * - 8
+     - 0.03386
+     - 0.023923
+   * - 16
+     - 0.06042
+     - 0.046183
+   * - 32
+     - 0.12421
+     - 0.087113
+
+SimulatedAnnealing pruner example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
+We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
+The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.
+
+.. image:: ../../img/SA_latency_accuracy.png
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** ( 0 minutes  4.528 seconds)
+
+
+.. _sphx_glr_download_tutorials_pruning_speedup.py:
+
+
+.. only :: html
+
+ .. container:: sphx-glr-footer
+    :class: sphx-glr-footer-example
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-python
+
+     :download:`Download Python source code: pruning_speedup.py <pruning_speedup.py>`
+
+
+
+  .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+     :download:`Download Jupyter notebook: pruning_speedup.ipynb <pruning_speedup.ipynb>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
--- a/docs/source/tutorials/pruning_speedup_codeobj.pickle
+++ b/docs/source/tutorials/pruning_speedup_codeobj.pickle
--- a/docs/source/tutorials/quantization_customize.ipynb
+++ b/docs/source/tutorials/quantization_customize.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "%matplotlib inline"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Customize a new quantization algorithm\n\nTo write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.\nThen, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.\n``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from nni.compression.pytorch import Quantizer\n\nclass YourQuantizer(Quantizer):\n    def __init__(self, model, config_list):\n        \"\"\"\n        Suggest you to use the NNI defined spec for config\n        \"\"\"\n        super().__init__(model, config_list)\n\n    def quantize_weight(self, weight, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize weight tensors.\n        This method is effectively hooked to :meth:`forward` of the model.\n\n        Parameters\n        ----------\n        weight : Tensor\n            weight that needs to be quantized\n        config : dict\n            the configuration for weight quantization\n        \"\"\"\n\n        # Put your code to generate `new_weight` here\n        new_weight = ...\n        return new_weight\n\n    def quantize_output(self, output, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize output.\n        This method is effectively hooked to `:meth:`forward` of the model.\n\n        Parameters\n        ----------\n        output : Tensor\n            output that needs to be quantized\n        config : dict\n            the configuration for output quantization\n        \"\"\"\n\n        # Put your code to generate `new_output` here\n        new_output = ...\n        return new_output\n\n    def quantize_input(self, *inputs, config, **kwargs):\n        \"\"\"\n        quantize should overload this method to quantize input.\n        This method is effectively hooked to :meth:`forward` of the model.\n\n        Parameters\n        ----------\n        inputs : Tensor\n            inputs that needs to be quantized\n        config : dict\n            the configuration for inputs quantization\n        \"\"\"\n\n        # Put your code to generate `new_input` here\n        new_input = ...\n        return new_input\n\n    def update_epoch(self, epoch_num):\n        pass\n\n    def step(self):\n        \"\"\"\n        Can do some processing based on the model or weights binded\n        in the func bind_model\n        \"\"\"\n        pass"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Customize backward function\n\nSometimes it's necessary for a quantization operation to have a customized backward function,\nsuch as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\\ ,\nuser can customize a backward function as follow:\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType\n\nclass ClipGrad(QuantGrad):\n    @staticmethod\n    def quant_backward(tensor, grad_output, quant_type):\n        \"\"\"\n        This method should be overrided by subclass to provide customized backward function,\n        default implementation is Straight-Through Estimator\n        Parameters\n        ----------\n        tensor : Tensor\n            input of quantization operation\n        grad_output : Tensor\n            gradient of the output of quantization operation\n        quant_type : QuantType\n            the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,\n            you can define different behavior for different types.\n        Returns\n        -------\n        tensor\n            gradient of the input of quantization operation\n        \"\"\"\n\n        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1\n        if quant_type == QuantType.OUTPUT:\n            grad_output[tensor.abs() > 1] = 0\n        return grad_output\n\nclass _YourQuantizer(Quantizer):\n    def __init__(self, model, config_list):\n        super().__init__(model, config_list)\n        # set your customized backward function to overwrite default backward function\n        self.quant_grad = ClipGrad"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. \n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.8.8"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
--- a/docs/source/tutorials/quantization_customize.py
+++ b/docs/source/tutorials/quantization_customize.py
+"""
+Customize a new quantization algorithm
+======================================
+
+To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``.
+Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``.
+``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+"""
+
+from nni.compression.pytorch import Quantizer
+
+class YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        """
+        Suggest you to use the NNI defined spec for config
+        """
+        super().__init__(model, config_list)
+
+    def quantize_weight(self, weight, config, **kwargs):
+        """
+        quantize should overload this method to quantize weight tensors.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        weight : Tensor
+            weight that needs to be quantized
+        config : dict
+            the configuration for weight quantization
+        """
+
+        # Put your code to generate `new_weight` here
+        new_weight = ...
+        return new_weight
+
+    def quantize_output(self, output, config, **kwargs):
+        """
+        quantize should overload this method to quantize output.
+        This method is effectively hooked to `:meth:`forward` of the model.
+
+        Parameters
+        ----------
+        output : Tensor
+            output that needs to be quantized
+        config : dict
+            the configuration for output quantization
+        """
+
+        # Put your code to generate `new_output` here
+        new_output = ...
+        return new_output
+
+    def quantize_input(self, *inputs, config, **kwargs):
+        """
+        quantize should overload this method to quantize input.
+        This method is effectively hooked to :meth:`forward` of the model.
+
+        Parameters
+        ----------
+        inputs : Tensor
+            inputs that needs to be quantized
+        config : dict
+            the configuration for inputs quantization
+        """
+
+        # Put your code to generate `new_input` here
+        new_input = ...
+        return new_input
+
+    def update_epoch(self, epoch_num):
+        pass
+
+    def step(self):
+        """
+        Can do some processing based on the model or weights binded
+        in the func bind_model
+        """
+        pass
+
+# %%
+# Customize backward function
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#
+# Sometimes it's necessary for a quantization operation to have a customized backward function,
+# such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ ,
+# user can customize a backward function as follow:
+
+from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
+
+class ClipGrad(QuantGrad):
+    @staticmethod
+    def quant_backward(tensor, grad_output, quant_type):
+        """
+        This method should be overrided by subclass to provide customized backward function,
+        default implementation is Straight-Through Estimator
+        Parameters
+        ----------
+        tensor : Tensor
+            input of quantization operation
+        grad_output : Tensor
+            gradient of the output of quantization operation
+        quant_type : QuantType
+            the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
+            you can define different behavior for different types.
+        Returns
+        -------
+        tensor
+            gradient of the input of quantization operation
+        """
+
+        # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+        if quant_type == QuantType.OUTPUT:
+            grad_output[tensor.abs() > 1] = 0
+        return grad_output
+
+class _YourQuantizer(Quantizer):
+    def __init__(self, model, config_list):
+        super().__init__(model, config_list)
+        # set your customized backward function to overwrite default backward function
+        self.quant_grad = ClipGrad
+
+# %%
+# If you do not customize ``QuantGrad``, the default backward is Straight-Through Estimator. 
--- a/docs/source/tutorials/quantization_customize.py.md5
+++ b/docs/source/tutorials/quantization_customize.py.md5
+387ac974594fa239c25479453b808ec8
\ No newline at end of file