[Doc] Remove legacy doc (#4623)

88ffe908 · J-shang · GitHub · 006e1a01 · 006e1a01 · 88ffe908
Unverified Commit 88ffe908 authored Mar 09, 2022 by J-shang Committed by GitHub Mar 09, 2022
19 changed files
--- a/docs/source/compression.rst
+++ b/docs/source/compression.rst
-#################
-Model Compression
-#################
-Deep neural networks (DNNs) have achieved great success in many tasks. However, typical neural networks are both
-computationally expensive and energy intensive, can be difficult to be deployed on devices with low computation
-resources or with strict latency requirements. Therefore, a natural thought is to perform model compression to
-reduce model size and accelerate model training/inference without losing performance significantly. Model compression
-techniques can be divided into two categories: pruning and quantization. The pruning methods explore the redundancy
-in the model weights and try to remove/prune the redundant and uncritical weights. Quantization refers to compressing
-models by reducing the number of bits required to represent weights or activations.
-NNI provides an easy-to-use toolkit to help user design and use model pruning and quantization algorithms.
-It supports Tensorflow and PyTorch with unified interface.
-For users to compress their models, they only need to add several lines in their code.
-There are some popular model compression algorithms built-in in NNI.
-Users could further use NNI's auto tuning power to find the best compressed model,
-which is detailed in Auto Model Compression.
-On the other hand, users could easily customize their new compression algorithms using NNI's interface.
-For details, please refer to the following tutorials:
-..  toctree::
-    :maxdepth: 2
-    Overview <compression/overview>
-    Pruning <compression/pruning>
-    Quantization <compression/quantization>
-    Advanced Usage <compression/advanced_usage>
-    Reference <compression/reference>
--- a/docs/source/compression/advanced_usage.rst
+++ b/docs/source/compression/advanced_usage.rst
@@ -6,6 +6,3 @@ Advanced Usage
    Customize Scheduled Pruning Process <pruning_scheduler>
    Utilities <compression_utils>
-    Framework (legacy) <legacy_framework>
-    Customize a New Algorithm (legacy) <legacy_customize_compressor>
-    Automatic Model Compression (legacy) <legacy_autocompression>
--- a/docs/source/reference/compression_config_list.rst
+++ b/docs/source/reference/compression_config_list.rst
--- a/docs/source/compression/overview.rst
+++ b/docs/source/compression/overview.rst
-Model Compression Overview
+Model Compression with NNI
 ==========================
-Deep neural networks (DNNs) have achieved great success in many tasks like embedded development and scenarios that needs rapid feedbacks.
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+    Pruning <pruning>
+    Quantization <quantization>
+    Config Specification <compression_config_list>
+    Advanced Usage <advanced_usage>
+.. attention::
+  NNI's model pruning framework has been upgraded to a more powerful version (named pruning v2 before nni v2.6).
+  The old version (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be out of maintenance. If for some reason you have to use the old pruning,
+  v2.6 is the last nni version to support old pruning version.
+.. Using rubric to prevent the section heading to be include into toc
+.. rubric:: Overview
+Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
 However, typical neural networks are both computationally expensive and energy-intensive,
 which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
 Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
 Model compression techniques can be divided into two categories: pruning and quantization.
 The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
-Quantization refers to compressing models by reducing the number of bits required to represent weights or activations functions.
+Quantization refers to compress models by reducing the number of bits required to represent weights or activations.
 We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
 .. image:: ../../img/prune_quant.jpg
   :target: ../../img/prune_quant.jpg
   :scale: 40%
   :alt:
 NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
 For users to compress their models, they only need to add several lines in their code.
 There are some popular model compression algorithms built-in in NNI.
@@ -33,8 +49,7 @@ There are several core features supported by NNI model compression:
 * Concise interface for users to customize their own compression algorithms.
-Compression Pipeline
+.. rubric:: Compression Pipeline
--------------------
 .. image:: ../../img/compression_flow.jpg
   :target: ../../img/compression_flow.jpg
@@ -49,13 +64,7 @@ If users want to apply both, a sequential mode is recommended as common practise
  The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
-Supported Algorithms
+.. rubric:: Supported Pruning Algorithms
--------------------
-The supported model compression algorithms include pruning algorithms and quantization algorithms.
-Pruning Algorithms
-^^^^^^^^^^^^^^^^^^
 Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
@@ -99,8 +108,7 @@ Pruning algorithms compress the original network by removing redundant weights o
     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
-Quantization Algorithms
+.. rubric:: Supported Quantization Algorithms
-^^^^^^^^^^^^^^^^^^^^^^^
 Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
@@ -124,21 +132,19 @@ Quantization algorithms compress the original network by reducing the number of
     - Post training quantizaiton. Collect quantization information during calibration with observers.
-Model Speedup
+.. rubric:: Model Speedup
-------------
 The final goal of model compression is to reduce inference latency and model size.
 However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
 For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
-Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The following figure shows how NNI prunes and speeds up your models. 
+Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model.
+The following figure shows how NNI prunes and speeds up your models. 
 .. image:: ../../img/pipeline_compress.jpg
   :target: ../../img/pipeline_compress.jpg
   :scale: 40%
   :alt:
 The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
 The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
\ No newline at end of file
--- a/docs/source/compression_zh.rst
+++ b/docs/source/compression_zh.rst
-.. d371fe9f337e7c445c2f3016fc939aaf
+.. 1d14b9d13cdd660f8e9dcb2abed0b185
-#################
 模型压缩
-#################
+========
+..  toctree::
+    :hidden:
+    :maxdepth: 2
+    模型剪枝 <pruning>
+    模型量化 <quantization>
+    用户配置 <compression_config_list>
+    高级用法 <advanced_usage>
 深度神经网络（DNNs）在许多领域都取得了巨大的成功。 然而，典型的神经网络是
 计算和能源密集型的，很难将其部署在计算资源匮乏
@@ -19,14 +27,3 @@ NNI 中也内置了一些主流的模型压缩算法。
 用户可以进一步利用 NNI 的自动调优功能找到最佳的压缩模型，
 该功能在自动模型压缩部分有详细介绍。
 另一方面，用户可以使用 NNI 的接口自定义新的压缩算法。
-详细信息，参考以下教程：
-..  toctree::
-    :maxdepth: 2
-    概述 <compression/overview>
-    模型剪枝 <compression/pruning>
-    模型量化 <compression/quantization>
-    高级用法 <compression/advanced_usage>
-    参考 <compression/reference>
--- a/docs/source/compression/legacy_autocompression.rst
+++ b/docs/source/compression/legacy_autocompression.rst
-Auto Compression with NNI Experiment
-====================================
-If you want to compress your model, but don't know what compression algorithm to choose, or don't know what sparsity is suitable for your model, or just want to try more possibilities, auto compression may help you.
-Users can choose different compression algorithms and define the algorithms' search space, then auto compression will launch an NNI experiment and try different compression algorithms with varying sparsity automatically. 
-Of course, in addition to the sparsity rate, users can also introduce other related parameters into the search space.
-If you don't know what is search space or how to write search space, `this <./Tutorial/SearchSpaceSpec.rst>`__ is for your reference.
-Auto compression using experience is similar to the NNI experiment in python.
-The main differences are as follows:
-* Use a generator to help generate search space object.
-* Need to provide the model to be compressed, and the model should have already been pre-trained.
-* No need to set ``trial_command``, additional need to set ``auto_compress_module`` as ``AutoCompressionExperiment`` input.
-.. note::
-    Auto compression only supports TPE Tuner, Random Search Tuner, Anneal Tuner, Evolution Tuner right now.
-Generate search space
---------------------
-Due to the extensive use of nested search space, we recommend a using generator to configure search space.
-The following is an example. Using ``add_config()`` add subconfig, then ``dumps()`` search space dict.
-.. code-block:: python
-    from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionSearchSpaceGenerator
-    generator = AutoCompressionSearchSpaceGenerator()
-    generator.add_config('level', [
-        {
-            "sparsity": {
-                "_type": "uniform",
-                "_value": [0.01, 0.99]
-            },
-            'op_types': ['default']
-        }
-    ])
-    generator.add_config('qat', [
-    {
-        'quant_types': ['weight', 'output'],
-        'quant_bits': {
-            'weight': 8,
-            'output': 8
-        },
-        'op_types': ['Conv2d', 'Linear']
-    }])
-    search_space = generator.dumps()
-Now we support the following pruners and quantizers:
-.. code-block:: python
-    PRUNER_DICT = {
-        'level': LevelPruner,
-        'slim': SlimPruner,
-        'l1': L1FilterPruner,
-        'l2': L2FilterPruner,
-        'fpgm': FPGMPruner,
-        'taylorfo': TaylorFOWeightFilterPruner,
-        'apoz': ActivationAPoZRankFilterPruner,
-        'mean_activation': ActivationMeanRankFilterPruner
-    }
-    QUANTIZER_DICT = {
-        'naive': NaiveQuantizer,
-        'qat': QAT_Quantizer,
-        'dorefa': DoReFaQuantizer,
-        'bnn': BNNQuantizer
-    }
-Provide user model for compression
----------------------------------
-Users need to inherit ``AbstractAutoCompressionModule`` and override the abstract class function.
-.. code-block:: python
-    from nni.algorithms.compression.pytorch.auto_compress import AbstractAutoCompressionModule
-    class AutoCompressionModule(AbstractAutoCompressionModule):
-        @classmethod
-        def model(cls) -> nn.Module:
-            ...
-            return _model
-        @classmethod
-        def evaluator(cls) -> Callable[[nn.Module], float]:
-            ...
-            return _evaluator
-Users need to implement at least ``model()`` and ``evaluator()``.
-If you use iterative pruner, you need to additional implement ``optimizer_factory()``, ``criterion()`` and ``sparsifying_trainer()``.
-If you want to finetune the model after compression, you need to implement ``optimizer_factory()``, ``criterion()``, ``post_compress_finetuning_trainer()`` and ``post_compress_finetuning_epochs()``.
-The ``optimizer_factory()`` should return a factory function, the input is an iterable variable, i.e. your ``model.parameters()``, and the output is an optimizer instance.
-The two kinds of ``trainer()`` should return a trainer with input ``model, optimizer, criterion, current_epoch``.
-The full abstract interface refers to :githublink:`interface.py <nni/algorithms/compression/pytorch/auto_compress/interface.py>`.
-An example of ``AutoCompressionModule`` implementation refers to :githublink:`auto_compress_module.py <examples/model_compress/auto_compress/torch/auto_compress_module.py>`.
-Launch NNI experiment
---------------------
-Similar to launch from python, the difference is no need to set ``trial_command`` and put the user-provided ``AutoCompressionModule`` as ``AutoCompressionExperiment`` input.
-.. code-block:: python
-    from pathlib import Path
-    from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionExperiment
-    from auto_compress_module import AutoCompressionModule
-    experiment = AutoCompressionExperiment(AutoCompressionModule, 'local')
-    experiment.config.experiment_name = 'auto compression torch example'
-    experiment.config.trial_concurrency = 1
-    experiment.config.max_trial_number = 10
-    experiment.config.search_space = search_space
-    experiment.config.trial_code_directory = Path(__file__).parent
-    experiment.config.tuner.name = 'TPE'
-    experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
-    experiment.config.training_service.use_active_gpu = True
-    experiment.run(8088)
--- a/docs/source/compression/legacy_customize_compressor.rst
+++ b/docs/source/compression/legacy_customize_compressor.rst
-Customize New Compression Algorithm
-===================================
-.. contents::
-In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
-**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference :doc:`Framework overview of model compression <legacy_framework>`
-Customize a new pruning algorithm
---------------------------------
-Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
-An implementation of ``weight masker`` may look like this:
-.. code-block:: python
-   class MyMasker(WeightMasker):
-       def __init__(self, model, pruner):
-           super().__init__(model, pruner)
-           # You can do some initialization here, such as collecting some statistics data
-           # if it is necessary for your algorithms to calculate the masks.
-       def calc_mask(self, sparsity, wrapper, wrapper_idx=None):
-           # calculate the masks based on the wrapper.weight, and sparsity, 
-           # and anything else
-           # mask = ...
-           return {'weight_mask': mask}
-You can reference nni provided :githublink:`weight masker <nni/algorithms/compression/pytorch/pruning/structured_pruning_masker.py>` implementations to implement your own weight masker.
-A basic ``pruner`` looks likes this:
-.. code-block:: python
-   class MyPruner(Pruner):
-       def __init__(self, model, config_list, optimizer):
-           super().__init__(model, config_list, optimizer)
-           self.set_wrappers_attribute("if_calculated", False)
-           # construct a weight masker instance
-           self.masker = MyMasker(model, self)
-       def calc_mask(self, wrapper, wrapper_idx=None):
-           sparsity = wrapper.config['sparsity']
-           if wrapper.if_calculated:
-               # Already pruned, do not prune again as a one-shot pruner
-               return None
-           else:
-               # call your masker to actually calcuate the mask for this layer
-               masks = self.masker.calc_mask(sparsity=sparsity, wrapper=wrapper, wrapper_idx=wrapper_idx)
-               wrapper.if_calculated = True
-               return masks
-Reference nni provided :githublink:`pruner <nni/algorithms/compression/pytorch/pruning/one_shot_pruner.py>` implementations to implement your own pruner class.
----
-Customize a new quantization algorithm
--------------------------------------
-To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
-.. code-block:: python
-   from nni.compression.pytorch import Quantizer
-   class YourQuantizer(Quantizer):
-       def __init__(self, model, config_list):
-           """
-           Suggest you to use the NNI defined spec for config
-           """
-           super().__init__(model, config_list)
-       def quantize_weight(self, weight, config, **kwargs):
-           """
-           quantize should overload this method to quantize weight tensors.
-           This method is effectively hooked to :meth:`forward` of the model.
-           Parameters
-           ----------
-           weight : Tensor
-               weight that needs to be quantized
-           config : dict
-               the configuration for weight quantization
-           """
-           # Put your code to generate `new_weight` here
-           return new_weight
-       def quantize_output(self, output, config, **kwargs):
-           """
-           quantize should overload this method to quantize output.
-           This method is effectively hooked to `:meth:`forward` of the model.
-           Parameters
-           ----------
-           output : Tensor
-               output that needs to be quantized
-           config : dict
-               the configuration for output quantization
-           """
-           # Put your code to generate `new_output` here
-           return new_output
-       def quantize_input(self, *inputs, config, **kwargs):
-           """
-           quantize should overload this method to quantize input.
-           This method is effectively hooked to :meth:`forward` of the model.
-           Parameters
-           ----------
-           inputs : Tensor
-               inputs that needs to be quantized
-           config : dict
-               the configuration for inputs quantization
-           """
-           # Put your code to generate `new_input` here
-           return new_input
-       def update_epoch(self, epoch_num):
-           pass
-       def step(self):
-           """
-           Can do some processing based on the model or weights binded
-           in the func bind_model
-           """
-           pass
-Customize backward function
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__ , user can customize a backward function as follow:
-.. code-block:: python
-   from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
-   class ClipGrad(QuantGrad):
-       @staticmethod
-       def quant_backward(tensor, grad_output, quant_type):
-           """
-           This method should be overrided by subclass to provide customized backward function,
-           default implementation is Straight-Through Estimator
-           Parameters
-           ----------
-           tensor : Tensor
-               input of quantization operation
-           grad_output : Tensor
-               gradient of the output of quantization operation
-           quant_type : QuantType
-               the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
-               you can define different behavior for different types.
-           Returns
-           -------
-           tensor
-               gradient of the input of quantization operation
-           """
-           # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
-           if quant_type == QuantType.OUTPUT:
-               grad_output[torch.abs(tensor) > 1] = 0
-           return grad_output
-   class YourQuantizer(Quantizer):
-       def __init__(self, model, config_list):
-           super().__init__(model, config_list)
-           # set your customized backward function to overwrite default backward function
-           self.quant_grad = ClipGrad
-If you do not customize ``QuantGrad``\ , the default backward is Straight-Through Estimator. 
-*Coming Soon* ...
--- a/docs/source/compression/legacy_framework.rst
+++ b/docs/source/compression/legacy_framework.rst
-Framework overview of model compression
-=======================================
-.. contents::
-Below picture shows the components overview of model compression framework.
-.. image:: ../../img/compressor_framework.jpg
-   :target: ../../img/compressor_framework.jpg
-   :alt: 
-There are 3 major components/classes in NNI model compression framework: ``Compressor``\ , ``Pruner`` and ``Quantizer``. Let's look at them in detail one by one:
-Compressor
----------
-Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
-.. code-block:: python
-   from nni.algorithms.compression.pytorch.pruning import LevelPruner
-   # load a pretrained model or train a model before using a pruner
-   configure_list = [{
-       'sparsity': 0.7,
-       'op_types': ['Conv2d', 'Linear'],
-   }]
-   pruner = LevelPruner(model, configure_list)
-   model = pruner.compress()
-   # model is ready for pruning, now start finetune the model,
-   # the model will be pruned during training automatically
-To use a quantizer:
-.. code-block:: python
-   from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
-   configure_list = [{
-       'quant_types': ['weight'],
-       'quant_bits': {
-           'weight': 8,
-       },
-       'op_types':['Conv2d', 'Linear']
-   }]
-   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
-   quantizer = DoReFaQuantizer(model, configure_list, optimizer)
-   quantizer.compress()
-View :githublink:`example code <examples/model_compress>` for more information.
-``Compressor`` class provides some utility methods for subclass and users:
-Set wrapper attribute
-^^^^^^^^^^^^^^^^^^^^^
-Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
-In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
-Collect data during forward
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Sometimes users want to collect some data during the modules' forward method, for example, the mean value of the activation. This can be done by adding a customized collector to module.
-.. code-block:: python
-   class MyMasker(WeightMasker):
-       def __init__(self, model, pruner):
-           super().__init__(model, pruner)
-           # Set attribute `collected_activation` for all wrappers to store
-           # activations for each layer
-           self.pruner.set_wrappers_attribute("collected_activation", [])
-           self.activation = torch.nn.functional.relu
-           def collector(wrapper, input_, output):
-               # The collected activation can be accessed via each wrapper's collected_activation
-               # attribute
-               wrapper.collected_activation.append(self.activation(output.detach().cpu()))
-           self.pruner.hook_id = self.pruner.add_activation_collector(collector)
-The collector function will be called each time the forward method runs.
-Users can also remove this collector like this:
-.. code-block:: python
-   # Save the collector identifier
-   collector_id = self.pruner.add_activation_collector(collector)
-   # When the collector is not used any more, it can be remove using
-   # the saved collector identifier
-   self.pruner.remove_activation_collector(collector_id)
----
-Pruner
------
-A pruner receives ``model`` , ``config_list`` as arguments. 
-Some pruners like ``TaylorFOWeightFilter Pruner`` prune the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
-Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
-Weight masker
-^^^^^^^^^^^^^
-A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
-Pruning module wrapper
-^^^^^^^^^^^^^^^^^^^^^^
-A ``pruning module wrapper`` is a module containing:
-#. the origin module
-#. some buffers used by ``calc_mask``
-#. a new forward method that applies masks before running the original forward method.
-the reasons to use ``module wrapper``\ :
-#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
-#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
-Pruning hook
-^^^^^^^^^^^^
-A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner's calc_mask method at ``optimizer.step()`` is invoked.
----
-Quantizer
---------
-Quantizer class is also a subclass of ``Compressor``\ , it is used to compress models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. It contains:
-Quantization module wrapper
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Each module/layer of the model to be quantized is wrapped by a quantization module wrapper, it provides a new ``forward`` method to quantize the original module's weight, input and output.
-Quantization hook
-^^^^^^^^^^^^^^^^^
-A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
-Quantization methods
-^^^^^^^^^^^^^^^^^^^^
-``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
-.. code-block:: python
-   class Quantizer(Compressor):
-       """
-       Base quantizer for pytorch quantizer
-       """
-       def quantize_weight(self, weight, wrapper, **kwargs):
-           """
-           quantize should overload this method to quantize weight.
-           This method is effectively hooked to :meth:`forward` of the model.
-           Parameters
-           ----------
-           weight : Tensor
-               weight that needs to be quantized
-           wrapper : QuantizerModuleWrapper
-               the wrapper for origin module
-           """
-           raise NotImplementedError('Quantizer must overload quantize_weight()')
-       def quantize_output(self, output, wrapper, **kwargs):
-           """
-           quantize should overload this method to quantize output.
-           This method is effectively hooked to :meth:`forward` of the model.
-           Parameters
-           ----------
-           output : Tensor
-               output that needs to be quantized
-           wrapper : QuantizerModuleWrapper
-               the wrapper for origin module
-           """
-           raise NotImplementedError('Quantizer must overload quantize_output()')
-       def quantize_input(self, *inputs, wrapper, **kwargs):
-           """
-           quantize should overload this method to quantize input.
-           This method is effectively hooked to :meth:`forward` of the model.
-           Parameters
-           ----------
-           inputs : Tensor
-               inputs that needs to be quantized
-           wrapper : QuantizerModuleWrapper
-               the wrapper for origin module
-           """
-           raise NotImplementedError('Quantizer must overload quantize_input()')
----
-Multi-GPU support
-----------------
-On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
-Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.
--- a/docs/source/reference/pruner.rst
+++ b/docs/source/reference/pruner.rst
--- a/docs/source/compression/pruning.rst
+++ b/docs/source/compression/pruning.rst
 Model Pruning with NNI
 ======================
-Pruning V2 is a refactoring of the old version and provides more powerful functions.
+Pruning is a common technique to compress neural network models.
-Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
-What's more, pruning V2 unifies the pruning process and provides a more free combination of pruning components.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
-Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning in the next step.
-Pruner will reset with the model and config list given by task generator then generate the masks in current step.
+The following concepts can help you understand pruning in NNI.
+.. Using rubric to prevent the section heading to be include into toc
+.. rubric:: Pruning Target
+Pruning target means where we apply the sparsity.
+Most pruning methods prune the weights to reduce the model size and accelerate the inference latency.
+Other pruning methods also apply sparsity on the inputs, outputs or intermediate states to accelerate the inference latency.
+NNI support pruning module weights right now, and will support other pruning targets in the future.
+.. rubric:: Basic Pruner
+Basic pruner generates the masks for each pruning targets (weights) for a determined sparsity ratio.
+It usually takes model and config as input arguments, then generate a mask for the model.
+.. rubric:: Scheduled Pruner
+Scheduled pruner decides how to allocate sparsity ratio to each pruning targets, it also handles the pruning speed up and finetuning logic.
+From the implementation logic, the scheduled pruner is a combination of pruning scheduler, basic pruner and task generator.
+Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning.
+Basic pruner will reset with the model and config list given by task generator then generate the masks.
 For a clearer structure vision, please refer to the figure below.
 .. image:: ../../img/pruning_process.png
   :target: ../../img/pruning_process.png
+   :scale: 80%
+   :align: center
   :alt:
-A pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
+More information about scheduled pruning process please refer to :doc:`Pruning Scheduler <pruning_scheduler>`.
+.. rubric:: Granularity
+Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
+:ref:`level-pruner` is the only fine-grained pruner in NNI, all other pruners pruning the output channels on weights.
+.. _dependency-awareode-for-output-channel-pruning:
+.. rubric:: Dependency-aware Mode for Output Channel Pruning
+Currently, we support ``dependency aware`` mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
+:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
+In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
+the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
+We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
+As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
+then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
+Take the following figure as an example.
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
+Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
+In this case, we cannot harvest the speed benefit from the model pruning.
+To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
+In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
+In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
+Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
+Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
+According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
+Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
+For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
+In this case, the ``dependency-aware pruner`` will 
+* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
+* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
-.. Note::
+In addition, for the convolutional layers that have more than one filter group,
+``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
+Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
-    But users can also use pruner directly like in the pruning V1.
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
-..  toctree::
+.. toctree::
+    :hidden:
    :maxdepth: 2
    Quickstart <../tutorials/pruning_quick_start_mnist>
-    Concepts <pruning_concepts>
+    Pruner Reference <pruner>
    Speed Up <../tutorials/pruning_speed_up>
-    Pruner V2 Reference <../reference/pruner>
-    Pruner Reference (legacy) <../reference/legacy_pruner>
--- a/docs/source/compression/pruning_concepts.rst
+++ b/docs/source/compression/pruning_concepts.rst
-Pruning Concepts
-================
-Pruning is a common technique to compress neural network models.
-The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
-The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
-In NNI, a pruning method is divided into multiple dimensions.
-Pruning Target
--------------
-Pruning target means where we apply the sparsity.
-Most pruning methods prune the weight to reduce the model size and accelerate the inference latency.
-Other pruning methods also apply sparsity on the input and output to accelerate the inference latency.
-NNI support pruning module weight right now, and will support pruning input & output in the future.
-Basic Pruners & Scheduled Pruners
---------------------------------
-Basic pruners generate the masks for each pruning targets (weights) for a determined sparsity ratio.
-Scheduled pruners decide how to allocate sparsity ratio to each pruning targets, they always work with basic pruner to generate masks.
-Granularity
-----------
-Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
-Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
-:ref:`level-pruner` is the only fine-grained pruner in NNI, all other pruners pruning the output channels on weights.
-.. _dependency-awareode-for-output-channel-pruning:
-Dependency-aware Mode for Output Channel Pruning
------------------------------------------------
-Currently, we support ``dependency aware`` mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
-:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
-In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
-the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
-We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
-As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
-then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
-Take the following figure as an example.
-.. image:: ../../img/mask_conflict.jpg
-   :target: ../../img/mask_conflict.jpg
-   :alt: 
-If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
-Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
-In this case, we cannot harvest the speed benefit from the model pruning.
-To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
-In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
-In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
-.. image:: ../../img/dependency-aware.jpg
-   :target: ../../img/dependency-aware.jpg
-   :alt: 
-Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
-Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
-Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
-According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
-Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
-For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
-In this case, the ``dependency-aware pruner`` will 
-* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
-* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
-In addition, for the convolutional layers that have more than one filter group,
-``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
-Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
-In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
--- a/docs/source/compression/quantization.rst
+++ b/docs/source/compression/quantization.rst
 Model Quantization with NNI
 ===========================
-..  toctree::
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
+which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
+format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
+can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
+is an active field of research.
+A quantizer is a quantization algorithm implementation in NNI, NNI provides multiple quantizers as below. You can also
+create your own quantizer using NNI model compression interface.
+.. toctree::
+    :hidden:
    :maxdepth: 2
    Quickstart <../tutorials/quantization_quick_start_mnist>
+    Quantizer Reference <quantizer>
    Speed Up <../tutorials/quantization_speed_up>
-    Quantizer Reference <../reference/quantizer>
--- a/docs/source/reference/quantizer.rst
+++ b/docs/source/reference/quantizer.rst
--- a/docs/source/compression/reference.rst
+++ b/docs/source/compression/reference.rst
-Reference
-=========
-..  toctree::
-    :maxdepth: 2
-    Pruner V2 Reference <../reference/pruner>
-    Pruner Reference (legacy) <../reference/legacy_pruner>
-    Quantizer Reference <../reference/quantizer>
-    Compression Config Specification <../reference/compression_config_list>
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -23,7 +23,7 @@ Neural Network Intelligence
    Overview
    Auto (Hyper-parameter) Tuning <hyperparameter_tune>
    Neural Architecture Search <nas/index>
-    Model Compression <compression>
+    Model Compression <compression/index>
    Feature Engineering <feature_engineering>
    Experiment <experiment/overview>

--- a/docs/source/index_zh.rst
+++ b/docs/source/index_zh.rst
-.. cbe5c6f0f6b5a054dc36d05b49d1986e
+.. 84633d9c4ebf3421e7618c56117045c2
 ###########################
 Neural Network Intelligence
@@ -16,7 +16,7 @@ Neural Network Intelligence
    教程<tutorials>
    自动（超参数）调优 <hyperparameter_tune>
    神经网络架构搜索<nas/index>
-    模型压缩<compression>
+    模型压缩<compression/index>
    特征工程<feature_engineering>
    NNI实验 <experiment/overview>
    参考<reference>

--- a/docs/source/reference.rst
+++ b/docs/source/reference.rst
@@ -17,4 +17,3 @@ References
    Supported Framework Library <SupportedFramework_Library>
    Launch from Python <Tutorial/HowToLaunchFromPython>
    Tensorboard <Tutorial/Tensorboard>
-    Compression <compression/reference>
--- a/docs/source/reference/legacy_pruner.rst
+++ b/docs/source/reference/legacy_pruner.rst
--- a/docs/source/reference_zh.rst
+++ b/docs/source/reference_zh.rst
-.. 317504c3009932f8a566616e85a9700f
+.. e8dca0b3551823aef1648bcef1745028
 :orphan: