[Doc] Compression (#4574)

db3130d7 · J-shang · GitHub · cef9babd · db3130d7 · db3130d7
Unverified Commit db3130d7 authored Feb 28, 2022 by J-shang Committed by GitHub Feb 28, 2022
20 changed files
--- a/docs/source/Compression/AutoCompression.rst
+++ b/docs/source/Compression/AutoCompression.rst
--- a/docs/source/Compression/CustomizeCompressor.rst
+++ b/docs/source/Compression/CustomizeCompressor.rst
@@ -5,7 +5,7 @@ Customize New Compression Algorithm
 In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
-**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression <../Compression/Framework.rst>`__
+**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference :doc:`Framework overview of model compression <legacy_framework>`
 Customize a new pruning algorithm
 ---------------------------------
@@ -136,7 +136,7 @@ To write a new quantization algorithm, you can write a class that inherits ``nni
 Customize backward function
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ , user can customize a backward function as follow:
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__ , user can customize a backward function as follow:
 .. code-block:: python

--- a/docs/source/Compression/Framework.rst
+++ b/docs/source/Compression/Framework.rst
--- a/docs/source/Compression/Overview.rst
+++ b/docs/source/Compression/Overview.rst
-Model Compression with NNI
+Model Compression Overview
 ==========================
-.. contents::
+Deep neural networks (DNNs) have achieved great success in many tasks.
+However, typical neural networks are both computationally expensive and energy-intensive,
+can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
+Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
+Model compression techniques can be divided into two categories: pruning and quantization.
+The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations.
-As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
+For users to compress their models, they only need to add several lines in their code.
-NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
+There are some popular model compression algorithms built-in in NNI.
+Users could further use NNI’s auto-tuning power to find the best compressed model, which is detailed in Auto Model Compression.
+On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
+There are several core features supported by NNI model compression:
 * Support many popular pruning and quantization algorithms.
 * Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
@@ -25,7 +34,10 @@ Compression Pipeline
 The overall compression pipeline in NNI. For compressing a pretrained model, pruning and quantization can be used alone or in combination. 
 .. note::
-  Since NNI compression algorithms are not meant to compress model while NNI speedup tool can truly compress model and reduce latency. To obtain a truly compact model, users should conduct `model speedup <./ModelSpeedup.rst>`__. The interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.
+  Since NNI compression algorithms are not meant to compress model while NNI speedup tool can truly compress model and reduce latency.
+  To obtain a truly compact model, users should conduct :doc:`model speedup <../tutorials/pruning_speed_up>`.
+  The interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.
 Supported Algorithms
 --------------------
@@ -43,42 +55,40 @@ Pruning algorithms compress the original network by removing redundant weights o
   * - Name
     - Brief Introduction of Algorithm
-   * - `Level Pruner <Pruner.rst#level-pruner>`__
+   * - :ref:`level-pruner`
     - Pruning the specified ratio on each weight based on absolute values of weights
-   * - `AGP Pruner <../Compression/Pruner.rst#agp-pruner>`__
+   * - :ref:`l1-norm-pruner`
-     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+     - Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
-   * - `Lottery Ticket Pruner <../Compression/Pruner.rst#lottery-ticket-hypothesis>`__
+   * - :ref:`l2-norm-pruner`
-     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+     - Pruning output channels with the smallest L2 norm of weights
-   * - `FPGM Pruner <../Compression/Pruner.rst#fpgm-pruner>`__
+   * - :ref:`fpgm-pruner`
-     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/pdf/1811.00250.pdf>`__
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/abs/1811.00250>`__
-   * - `L1Filter Pruner <../Compression/Pruner.rst#l1filter-pruner>`__
+   * - :ref:`slim-pruner`
-     - Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+     - Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
-   * - `L2Filter Pruner <../Compression/Pruner.rst#l2filter-pruner>`__
+   * - :ref:`activation-apoz-rank-pruner`
-     - Pruning filters with the smallest L2 norm of weights in convolution layers
+     - Pruning output channels based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
-   * - `ActivationAPoZRankFilterPruner <../Compression/Pruner.rst#activationapozrankfilter-pruner>`__
+   * - :ref:`activation-mean-rank-pruner`
-     - Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+     - Pruning output channels based on the metric that calculates the smallest mean value of output activations
-   * - `ActivationMeanRankFilterPruner <../Compression/Pruner.rst#activationmeanrankfilter-pruner>`__
+   * - :ref:`taylor-fo-weight-pruner`
-     - Pruning filters based on the metric that calculates the smallest mean value of output activations
-   * - `Slim Pruner <../Compression/Pruner.rst#slim-pruner>`__
-     - Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
-   * - `TaylorFO Pruner <../Compression/Pruner.rst#taylorfoweightfilter-pruner>`__
     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
-   * - `ADMM Pruner <../Compression/Pruner.rst#admm-pruner>`__
+   * - :ref:`admm-pruner`
     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
-   * - `NetAdapt Pruner <../Compression/Pruner.rst#netadapt-pruner>`__
+   * - :ref:`linear-pruner`
-     - Automatically simplify a pretrained network to meet the resource budget by iterative pruning  `Reference Paper <https://arxiv.org/abs/1804.03230>`__
+     - Sparsity ratio increases linearly during each pruning rounds, in each round, using a basic pruner to prune the model.
-   * - `SimulatedAnnealing Pruner <../Compression/Pruner.rst#simulatedannealing-pruner>`__
+   * - :ref:`agp-pruner`
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - :ref:`lottery-ticket-pruner`
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - :ref:`simulated-annealing-pruner`
     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
-   * - `AutoCompress Pruner <../Compression/Pruner.rst#autocompress-pruner>`__
+   * - :ref:`auto-compress-pruner`
     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
-   * - `AMC Pruner <../Compression/Pruner.rst#amc-pruner>`__
+   * - :ref:`amc-pruner`
-     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/abs/1802.03494>`__
-   * - `Transformer Head Pruner <../Compression/Pruner.rst#transformer-head-pruner>`__
+   * - :ref:`movement-pruner`
-     - Pruning attention heads from transformer models either in one shot or iteratively.
+     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
-You can refer to this `benchmark <../CommunitySharings/ModelCompressionComparison.rst>`__ for the performance of these pruners on some benchmark problems.
 Quantization Algorithms
 ^^^^^^^^^^^^^^^^^^^^^^^
@@ -90,42 +100,26 @@ Quantization algorithms compress the original network by reducing the number of
   * - Name
     - Brief Introduction of Algorithm
-   * - `Naive Quantizer <../Compression/Quantizer.rst#naive-quantizer>`__
+   * - :ref:`naive-quantizer`
     - Quantize weights to default 8 bits
-   * - `QAT Quantizer <../Compression/Quantizer.rst#qat-quantizer>`__
+   * - :ref:`qat-quantizer`
     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
-   * - `DoReFa Quantizer <../Compression/Quantizer.rst#dorefa-quantizer>`__
+   * - :ref:`dorefa-quantizer`
     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
-   * - `BNN Quantizer <../Compression/Quantizer.rst#bnn-quantizer>`__
+   * - :ref:`bnn-quantizer`
     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
-   * - `LSQ Quantizer <../Compression/Quantizer.rst#lsq-quantizer>`__
+   * - :ref:`lsq-quantizer`
     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
-   * - `Observer Quantizer <../Compression/Quantizer.rst#observer-quantizer>`__
+   * - :ref:`observer-quantizer`
     - Post training quantizaiton. Collect quantization information during calibration with observers.
 Model Speedup
 -------------
-The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Masked Model Speedup can be found `here <./ModelSpeedup.rst>`__, The detailed tutorial of Mixed Precision Quantization Model Speedup can be found `here <./QuantizationSpeedup.rst>`__.
+The final goal of model compression is to reduce inference latency and model size.
+However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
+For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
-Compression Utilities
+Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model.
---------------------
+The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
+The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
-Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
\ No newline at end of file
-Advanced Usage
--------------
-NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.
-Reference and Feedback
----------------------
-* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
-* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
-* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
-* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
-* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
--- a/docs/source/Compression/v2_pruning.rst
+++ b/docs/source/Compression/v2_pruning.rst
-Pruning V2
+Model Pruning with NNI
-==========
+======================
 Pruning V2 is a refactoring of the old version and provides more powerful functions.
 Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
@@ -13,14 +13,17 @@ For a clearer structure vision, please refer to the figure below.
   :target: ../../img/pruning_process.png
   :alt:
-In V2, a pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
+A pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
-But users can also use pruner directly like in the pruning V1.
-For details, please refer to the following tutorials:
+.. Note::
+    But users can also use pruner directly like in the pruning V1.
 ..  toctree::
    :maxdepth: 2
-    Pruning Algorithms <v2_pruning_algo>
+    Quickstart <../tutorials/pruning_quick_start_mnist>
-    Pruning Scheduler <v2_scheduler>
+    Concepts <pruning_concepts>
-    Pruning Config List <v2_pruning_config_list>
+    Speed Up <../tutorials/pruning_speed_up>
+    Pruner V2 Reference <../reference/pruner>
+    Pruner Reference (legacy) <../reference/legacy_pruner>
--- a/docs/source/compression/pruning_concepts.rst
+++ b/docs/source/compression/pruning_concepts.rst
+Pruning Concepts
+================
+Pruning is a common technique to compress neural network models.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
+In NNI, a pruning method is divided into multiple dimensions.
+Pruning Target
+--------------
+Pruning target means where we apply the sparsity.
+Most pruning methods prune the weight to reduce the model size and accelerate the inference latency.
+Other pruning methods also apply sparsity on the input and output to accelerate the inference latency.
+NNI support pruning module weight right now, and will support pruning input & output in the future.
+Basic Pruners & Scheduled Pruners
+---------------------------------
+Basic pruners generate the masks for each pruning targets (weights) for a determined sparsity ratio.
+Scheduled pruners decide how to allocate sparsity ratio to each pruning targets, they always work with basic pruner to generate masks.
+Granularity
+-----------
+Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
+:ref:`level-pruner` is the only fine-grained pruner in NNI, all other pruners pruning the output channels on weights.
+.. _dependency-awareode-for-output-channel-pruning:
+Dependency-aware Mode for Output Channel Pruning
+------------------------------------------------
+Currently, we support ``dependency aware`` mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
+:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
+In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
+the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
+We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
+As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
+then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
+Take the following figure as an example.
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :alt: 
+If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
+Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
+In this case, we cannot harvest the speed benefit from the model pruning.
+To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
+In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
+In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :alt: 
+Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
+Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
+Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
+According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
+Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
+For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
+In this case, the ``dependency-aware pruner`` will 
+* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
+* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+In addition, for the convolutional layers that have more than one filter group,
+``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
+Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
--- a/docs/source/Compression/v2_scheduler.rst
+++ b/docs/source/Compression/v2_scheduler.rst
--- a/docs/source/compression/quantization.rst
+++ b/docs/source/compression/quantization.rst
+Model Quantization with NNI
+===========================
+..  toctree::
+    :maxdepth: 2
+    Quickstart <../tutorials/quantization_quick_start_mnist>
+    Speed Up <../tutorials/quantization_speed_up>
+    Quantizer Reference <../reference/quantizer>
--- a/docs/source/compression/reference.rst
+++ b/docs/source/compression/reference.rst
+Reference
+=========
+..  toctree::
+    :maxdepth: 2
+    Pruner V2 Reference <../reference/pruner>
+    Pruner Reference (legacy) <../reference/legacy_pruner>
+    Quantizer Reference <../reference/quantizer>
+    Compression Config Specification <../reference/compression_config_list>
--- a/docs/source/model_compression_zh.rst
+++ b/docs/source/model_compression_zh.rst
-.. da97b4cdd507bd8fad43d640f3d2bfef
+.. d371fe9f337e7c445c2f3016fc939aaf
 #################
 模型压缩
@@ -25,12 +25,8 @@ NNI 中也内置了一些主流的模型压缩算法。
 ..  toctree::
    :maxdepth: 2
-    概述 <Compression/Overview>
+    概述 <compression/overview>
-    快速入门 <Compression/QuickStart>
+    模型剪枝 <compression/pruning>
-    教程 <Compression/Tutorial>
+    模型量化 <compression/quantization>
-    剪枝 <Compression/pruning>
+    高级用法 <compression/advanced_usage>
-    剪枝（V2版本） <Compression/v2_pruning>
+    参考 <compression/reference>
-    量化 <Compression/quantization>
-    工具 <Compression/CompressionUtils>
-    高级用法 <Compression/advanced>
-    API 参考 <Compression/CompressionReference>
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -23,7 +23,7 @@ Neural Network Intelligence
    Overview
    Auto (Hyper-parameter) Tuning <hyperparameter_tune>
    Neural Architecture Search <nas/index>
-    Model Compression <model_compression>
+    Model Compression <compression>
    Feature Engineering <feature_engineering>
    Experiment <experiment/overview>

--- a/docs/source/index_zh.rst
+++ b/docs/source/index_zh.rst
-.. a533c818ceeb6667df82045885d5ccb5
+.. ef76cab17df95cdf2b872fdba5dffa38
 ###########################
 Neural Network Intelligence
@@ -16,7 +16,7 @@ Neural Network Intelligence
    教程<tutorials>
    自动（超参数）调优 <hyperparameter_tune>
    神经网络架构搜索<nas/index>
-    模型压缩<model_compression>
+    模型压缩<compression>
    特征工程<feature_engineering>
    NNI实验 <experiment/overview>
    参考<reference>

--- a/docs/source/reference.rst
+++ b/docs/source/reference.rst
@@ -14,3 +14,4 @@ References
    Launch from Python <Tutorial/HowToLaunchFromPython>
    Shared Storage <Tutorial/HowToUseSharedStorage>
    Tensorboard <Tutorial/Tensorboard>
+    Compression <compression/reference>
--- a/docs/source/Compression/v2_pruning_config_list.rst
+++ b/docs/source/Compression/v2_pruning_config_list.rst
-Pruning Config Specification
+Compression Config Specification
-============================
+================================
-The Keys in Config List
-----------------------
 Each sub-config in the config list is a dict, and the scope of each setting (key) is only internal to each sub-config.
 If multiple sub-configs are configured for the same layer, the later ones will overwrite the previous ones.
+Shared Keys in both Pruning & Quantization Config
+-------------------------------------------------
 op_types
 ^^^^^^^^
@@ -20,9 +20,20 @@ op_names
 The name of the layers targeted by this sub-config.
 If ``op_types`` is set in this sub-config, the selected layer should satisfy both type and name.
+exclude
+^^^^^^^
+The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
+If ``exclude`` is set in sub-config, the layers selected by this config will not be compressed.
+The Keys in Pruning Config
+--------------------------
 op_partial_names
 ^^^^^^^^^^^^^^^^
+This key will share with `Quantization Config` in the future.
 This key is for the layers to be pruned with names that have the same sub-string. NNI will find all names in the model,
 find names that contain one of ``op_partial_names``, and append them into the ``op_names``.
@@ -60,8 +71,25 @@ In ``total_sparsity`` example, there are 1200 parameters that need to be masked
 To avoid this situation, ``max_sparsity_per_layer`` can be set as 0.9, this means up to 450 parameters can be masked in ``layer_1``,
 and 900 parameters can be masked in ``layer_2``.
-exclude
+The Keys in Quantization Config List
-^^^^^^^
+------------------------------------
-The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
+quant_types
-If ``exclude`` is set in sub-config, the layers selected by this config will not be pruned.
+^^^^^^^^^^^
+Currently, nni support three kind of quantization types: 'weight', 'input', 'output'.
+It can be set as ``str`` or ``List[str]``.
+Note that 'weight' and 'input' are always quantize together, e.g., ``['input', 'weight']``.
+quant_bits
+^^^^^^^^^^
+Bits length of quantization, key is the quantization type set in ``quant_types``, value is the length,
+eg. {'weight': 8}, when the type is int, all quantization types share same bits length.
+quant_start_step
+^^^^^^^^^^^^^^^^
+Specific key for ``QAT Quantizer``. Disable quantization until model are run by certain number of steps,
+this allows the network to enter a more stable.
+State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
--- a/docs/source/Compression/Pruner.rst
+++ b/docs/source/Compression/Pruner.rst
@@ -122,7 +122,7 @@ FPGM Pruner
 This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
 For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
-We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
 Usage
 ^^^^^
@@ -168,7 +168,7 @@ For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https
-In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
+In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference :ref:`dependency-awareode-for-output-channel-pruning`.
 Usage
 ^^^^^
@@ -225,7 +225,7 @@ L2Filter Pruner
 This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
-We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
 Usage
 ^^^^^
@@ -258,7 +258,7 @@ The APoZ is defined as:
 :math:`APoZ_{c}^{(i)} = APoZ\left(O_{c}^{(i)}\right)=\frac{\sum_{k}^{N} \sum_{j}^{M} f\left(O_{c, j}^{(i)}(k)=0\right)}{N \times M}`
-We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
 Usage
 ^^^^^
@@ -293,7 +293,7 @@ ActivationMeanRankFilter Pruner
 ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
-We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
 Usage
 ^^^^^
@@ -333,7 +333,7 @@ TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based
 :math:`\widehat{\mathcal{I}}_{\mathcal{S}}^{(1)}(\mathbf{W}) \triangleq \sum_{s \in \mathcal{S}} \mathcal{I}_{s}^{(1)}(\mathbf{W})=\sum_{s \in \mathcal{S}}\left(g_{s} w_{s}\right)^{2}`
-We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
+We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
 What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation. Please set parameter 'global_sort' to True when instantiate TaylorFOWeightFilterPruner.

--- a/docs/source/reference/pruner.rst
+++ b/docs/source/reference/pruner.rst
+Pruner Reference
+================
+Basic Pruner
+------------
+.. _level-pruner:
+Level Pruner
+^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LevelPruner
+.. _l1-norm-pruner:
+L1 Norm Pruner
+^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.L1NormPruner
+.. _l2-norm-pruner:
+L2 Norm Pruner
+^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.L2NormPruner
+.. _fpgm-pruner:
+FPGM Pruner
+^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.FPGMPruner
+.. _slim-pruner:
+Slim Pruner
+^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.SlimPruner
+.. _activation-apoz-rank-pruner:
+Activation APoZ Rank Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ActivationAPoZRankPruner
+.. _activation-mean-rank-pruner:
+Activation Mean Rank Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ActivationMeanRankPruner
+.. _taylor-fo-weight-pruner:
+Taylor FO Weight Pruner
+^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.TaylorFOWeightPruner
+.. _admm-pruner:
+ADMM Pruner
+^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.ADMMPruner
+Scheduled Pruners
+-----------------
+.. _linear-pruner:
+Linear Pruner
+^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LinearPruner
+.. _agp-pruner:
+AGP Pruner
+^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AGPPruner
+.. _lottery-ticket-pruner:
+Lottery Ticket Pruner
+^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.LotteryTicketPruner
+.. _simulated-annealing-pruner:
+Simulated Annealing Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.SimulatedAnnealingPruner
+.. _auto-compress-pruner:
+Auto Compress Pruner
+^^^^^^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AutoCompressPruner
+.. _amc-pruner:
+AMC Pruner
+^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.v2.pytorch.pruning.AMCPruner
+Other Pruner
+------------
+.. _movement-pruner:
+Movement Pruner
+^^^^^^^^^^^^^^^
+.. autoclass:: nni.algorithms.compression.v2.pytorch.pruning.MovementPruner
\ No newline at end of file
--- a/docs/source/reference/quantizer.rst
+++ b/docs/source/reference/quantizer.rst
+Quantizer Reference
+===================
+.. _naive-quantizer:
+Naive Quantizer
+^^^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.NaiveQuantizer
+.. _qat-quantizer:
+QAT Quantizer
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.QAT_Quantizer
+.. _dorefa-quantizer:
+DoReFa Quantizer
+^^^^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.DoReFaQuantizer
+.. _bnn-quantizer:
+BNN Quantizer
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.BNNQuantizer
+.. _lsq-quantizer:
+LSQ Quantizer
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.LsqQuantizer
+.. _observer-quantizer:
+Observer Quantizer
+^^^^^^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.ObserverQuantizer
--- a/docs/source/reference_zh.rst
+++ b/docs/source/reference_zh.rst
-.. 19ce4f2ee1d3c4f1be277ab09ba40092
+.. e29a21e77a74847fc6a45deeb126bb54
 参考
 ==================

--- a/docs/source/sdk_reference.rst
+++ b/docs/source/sdk_reference.rst
@@ -7,5 +7,4 @@ Python API Reference
    :maxdepth: 1
    Auto Tune <autotune_ref>
-    Compression <Compression/CompressionReference>
    Python API <Tutorial/HowToLaunchFromPython>
\ No newline at end of file
--- a/docs/source/sdk_reference_zh.rst
+++ b/docs/source/sdk_reference_zh.rst
-.. b1551bf7ef0c652ee5078598183fda45
+.. 577f3d11c9b75f47c5a100db2be97e8f
 ####################
 Python API 参考
@@ -9,5 +9,4 @@ Python API 参考
    :maxdepth: 1
    自动调优 <autotune_ref>
-    模型压缩 <Compression/CompressionReference>
    Python API <Tutorial/HowToLaunchFromPython>
\ No newline at end of file