the source code of NNI for DCU

1011377c · qianyj · abc22158 · 1011377c · 1011377c · 1011377c
Commit 1011377c authored Mar 31, 2022 by qianyj
20 changed files
--- a/docs/en_US/CommunitySharings/feature_engineering.rst
+++ b/docs/en_US/CommunitySharings/feature_engineering.rst
+###################
+Feature Engineering
+###################
+
+The following is an article about how NNI helps in auto feature engineering shared by a community contributor. More use cases and solutions will be added in the future.
+
+..  toctree::
+    :maxdepth: 1
+
+    NNI review article from Zhihu: - By Garvin Li <NNI_AutoFeatureEng>
\ No newline at end of file
--- a/docs/en_US/CommunitySharings/model_compression.rst
+++ b/docs/en_US/CommunitySharings/model_compression.rst
+#################
+Model Compression
+#################
+
+The following one shows how to apply knowledge distillation on NNI model compression. More use cases and solutions will be added in the future.
+
+..  toctree::
+    :maxdepth: 1
+
+    Knowledge distillation with NNI model compression <../TrialExample/KDExample>
\ No newline at end of file
--- a/docs/en_US/CommunitySharings/perf_compare.rst
+++ b/docs/en_US/CommunitySharings/perf_compare.rst
+################################################
+Performance Measurement, Comparison and Analysis
+################################################
+
+Performance comparison and analysis can help users decide a proper algorithm (e.g., tuner, NAS algorithm) for their scenario. The following are some measurement and comparison data for users' reference.
+
+..  toctree::
+    :maxdepth: 1
+
+    Neural Architecture Search Comparison <NasComparison>
+    Hyper-parameter Tuning Algorithm Comparsion <HpoComparison>
+    Model Compression Algorithm Comparsion <ModelCompressionComparison>
\ No newline at end of file
--- a/docs/en_US/Compression/AutoCompression.rst
+++ b/docs/en_US/Compression/AutoCompression.rst
+Auto Compression with NNI Experiment
+====================================
+
+If you want to compress your model, but don't know what compression algorithm to choose, or don't know what sparsity is suitable for your model, or just want to try more possibilities, auto compression may help you.
+Users can choose different compression algorithms and define the algorithms' search space, then auto compression will launch an NNI experiment and try different compression algorithms with varying sparsity automatically. 
+Of course, in addition to the sparsity rate, users can also introduce other related parameters into the search space.
+If you don't know what is search space or how to write search space, `this <./Tutorial/SearchSpaceSpec.rst>`__ is for your reference.
+Auto compression using experience is similar to the NNI experiment in python.
+The main differences are as follows:
+
+* Use a generator to help generate search space object.
+* Need to provide the model to be compressed, and the model should have already been pre-trained.
+* No need to set ``trial_command``, additional need to set ``auto_compress_module`` as ``AutoCompressionExperiment`` input.
+
+.. note::
+    Auto compression only supports TPE Tuner, Random Search Tuner, Anneal Tuner, Evolution Tuner right now.
+
+Generate search space
+---------------------
+
+Due to the extensive use of nested search space, we recommend a using generator to configure search space.
+The following is an example. Using ``add_config()`` add subconfig, then ``dumps()`` search space dict.
+
+.. code-block:: python
+
+    from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionSearchSpaceGenerator
+
+    generator = AutoCompressionSearchSpaceGenerator()
+    generator.add_config('level', [
+        {
+            "sparsity": {
+                "_type": "uniform",
+                "_value": [0.01, 0.99]
+            },
+            'op_types': ['default']
+        }
+    ])
+    generator.add_config('qat', [
+    {
+        'quant_types': ['weight', 'output'],
+        'quant_bits': {
+            'weight': 8,
+            'output': 8
+        },
+        'op_types': ['Conv2d', 'Linear']
+    }])
+
+    search_space = generator.dumps()
+
+Now we support the following pruners and quantizers:
+
+.. code-block:: python
+
+    PRUNER_DICT = {
+        'level': LevelPruner,
+        'slim': SlimPruner,
+        'l1': L1FilterPruner,
+        'l2': L2FilterPruner,
+        'fpgm': FPGMPruner,
+        'taylorfo': TaylorFOWeightFilterPruner,
+        'apoz': ActivationAPoZRankFilterPruner,
+        'mean_activation': ActivationMeanRankFilterPruner
+    }
+
+    QUANTIZER_DICT = {
+        'naive': NaiveQuantizer,
+        'qat': QAT_Quantizer,
+        'dorefa': DoReFaQuantizer,
+        'bnn': BNNQuantizer
+    }
+
+Provide user model for compression
+----------------------------------
+
+Users need to inherit ``AbstractAutoCompressionModule`` and override the abstract class function.
+
+.. code-block:: python
+
+    from nni.algorithms.compression.pytorch.auto_compress import AbstractAutoCompressionModule
+
+    class AutoCompressionModule(AbstractAutoCompressionModule):
+        @classmethod
+        def model(cls) -> nn.Module:
+            ...
+            return _model
+
+        @classmethod
+        def evaluator(cls) -> Callable[[nn.Module], float]:
+            ...
+            return _evaluator
+
+Users need to implement at least ``model()`` and ``evaluator()``.
+If you use iterative pruner, you need to additional implement ``optimizer_factory()``, ``criterion()`` and ``sparsifying_trainer()``.
+If you want to finetune the model after compression, you need to implement ``optimizer_factory()``, ``criterion()``, ``post_compress_finetuning_trainer()`` and ``post_compress_finetuning_epochs()``.
+The ``optimizer_factory()`` should return a factory function, the input is an iterable variable, i.e. your ``model.parameters()``, and the output is an optimizer instance.
+The two kinds of ``trainer()`` should return a trainer with input ``model, optimizer, criterion, current_epoch``.
+The full abstract interface refers to :githublink:`interface.py <nni/algorithms/compression/pytorch/auto_compress/interface.py>`.
+An example of ``AutoCompressionModule`` implementation refers to :githublink:`auto_compress_module.py <examples/model_compress/auto_compress/torch/auto_compress_module.py>`.
+
+Launch NNI experiment
+---------------------
+
+Similar to launch from python, the difference is no need to set ``trial_command`` and put the user-provided ``AutoCompressionModule`` as ``AutoCompressionExperiment`` input.
+
+.. code-block:: python
+
+    from pathlib import Path
+    from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionExperiment
+
+    from auto_compress_module import AutoCompressionModule
+
+    experiment = AutoCompressionExperiment(AutoCompressionModule, 'local')
+    experiment.config.experiment_name = 'auto compression torch example'
+    experiment.config.trial_concurrency = 1
+    experiment.config.max_trial_number = 10
+    experiment.config.search_space = search_space
+    experiment.config.trial_code_directory = Path(__file__).parent
+    experiment.config.tuner.name = 'TPE'
+    experiment.config.tuner.class_args['optimize_mode'] = 'maximize'
+    experiment.config.training_service.use_active_gpu = True
+
+    experiment.run(8088)
--- a/docs/en_US/Compression/CompressionReference.rst
+++ b/docs/en_US/Compression/CompressionReference.rst
+Model Compression API Reference
+===============================
+
+.. contents::
+
+Compressors
+-----------
+
+Compressor
+^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.compressor.Compressor
+    :members:
+
+..  autoclass:: nni.compression.pytorch.compressor.Pruner
+    :members:
+
+..  autoclass:: nni.compression.pytorch.compressor.Quantizer
+    :members:
+
+
+Module Wrapper
+^^^^^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.compressor.PrunerModuleWrapper
+    :members:
+
+
+..  autoclass:: nni.compression.pytorch.compressor.QuantizerModuleWrapper
+    :members:
+
+Weight Masker
+^^^^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.weight_masker.WeightMasker
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.structured_pruning_masker.StructuredWeightMasker
+    :members:
+
+
+Pruners
+^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.sensitivity_pruner.SensitivityPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot_pruner.OneshotPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot_pruner.LevelPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot_pruner.L1FilterPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot_pruner.L2FilterPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.one_shot_pruner.FPGMPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.IterativePruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.SlimPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.TaylorFOWeightFilterPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.ActivationAPoZRankFilterPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.ActivationMeanRankFilterPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.AGPPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.iterative_pruner.ADMMPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.auto_compress_pruner.AutoCompressPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.net_adapt_pruner.NetAdaptPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.simulated_annealing_pruner.SimulatedAnnealingPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.lottery_ticket.LotteryTicketPruner
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.transformer_pruner.TransformerHeadPruner
+    :members:
+
+Quantizers
+^^^^^^^^^^
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.NaiveQuantizer
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.QAT_Quantizer
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.DoReFaQuantizer
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.BNNQuantizer
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.LsqQuantizer
+    :members:
+
+..  autoclass:: nni.algorithms.compression.pytorch.quantization.ObserverQuantizer
+    :members:
+
+Model Speedup
+-------------
+
+Quantization Speedup
+^^^^^^^^^^^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.quantization_speedup.backend.BaseModelSpeedup
+    :members:
+
+..  autoclass:: nni.compression.pytorch.quantization_speedup.integrated_tensorrt.ModelSpeedupTensorRT
+    :members:
+
+..  autoclass:: nni.compression.pytorch.quantization_speedup.calibrator.Calibrator
+    :members:
+
+
+Compression Utilities
+---------------------
+
+Sensitivity Utilities
+^^^^^^^^^^^^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.utils.sensitivity_analysis.SensitivityAnalysis
+    :members:
+
+Topology Utilities
+^^^^^^^^^^^^^^^^^^
+
+..  autoclass:: nni.compression.pytorch.utils.shape_dependency.ChannelDependency
+    :members:
+
+..  autoclass:: nni.compression.pytorch.utils.shape_dependency.GroupDependency
+    :members:
+
+..  autoclass:: nni.compression.pytorch.utils.mask_conflict.GroupMaskConflict
+    :members:
+
+..  autoclass:: nni.compression.pytorch.utils.mask_conflict.ChannelMaskConflict
+    :members:
+
+Model FLOPs/Parameters Counter
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+..  autofunction:: nni.compression.pytorch.utils.counter.count_flops_params
--- a/docs/en_US/Compression/CompressionUtils.rst
+++ b/docs/en_US/Compression/CompressionUtils.rst
+Analysis Utils for Model Compression
+====================================
+
+.. contents::
+
+We provide several easy-to-use tools for users to analyze their model during model compression.
+
+Sensitivity Analysis
+--------------------
+
+First, we provide a sensitivity analysis tool (\ **SensitivityAnalysis**\ ) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to. 
+
+Usage
+^^^^^
+
+The following codes show the basic usage of the SensitivityAnalysis.
+
+.. code-block:: python
+
+   from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
+
+   def val(model):
+       model.eval()
+       total = 0
+       correct = 0
+       with torch.no_grad():
+           for batchid, (data, label) in enumerate(val_loader):
+               data, label = data.cuda(), label.cuda()
+               out = model(data)
+               _, predicted = out.max(1)
+               total += data.size(0)
+               correct += predicted.eq(label).sum().item()
+       return correct / total
+
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val)
+   sensitivity = s_analyzer.analysis(val_args=[net])
+   os.makedir(outdir)
+   s_analyzer.export(os.path.join(outdir, filename))
+
+Two key parameters of SensitivityAnalysis are ``model``\ , and ``val_func``. ``model`` is the neural network that to be analyzed and the ``val_func`` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
+SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
+
+Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter ``sparsities``.
+
+.. code-block:: python
+
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75])
+
+the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
+
+Users can also speed up the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes:  minimize, maximize, dropped, raised.
+
+minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
+
+maximize: The analysis stops when the validation metric return by the val_func larger than ``early_stop_value``.
+
+dropped: The analysis stops when the validation metric has dropped by ``early_stop_value``.
+
+raised: The analysis stops when the validation metric has raised by ``early_stop_value``.
+
+.. code-block:: python
+
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75], early_stop_mode='dropped', early_stop_value=0.1)
+
+If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the ``specified_layers`` in analysis function. ``specified_layers`` is a list that consists of the Pytorch module names of the conv layers. For example
+
+.. code-block:: python
+
+   sensitivity = s_analyzer.analysis(val_args=[net], specified_layers=['Conv1'])
+
+In this example, only the ``Conv1`` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
+
+Output example
+^^^^^^^^^^^^^^
+
+The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
+not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
+
+.. code-block:: bash
+
+   layername,0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.85,0.95
+   features.0,0.54566,0.46308,0.06978,0.0374,0.03024,0.01512,0.00866,0.00492,0.00184
+   features.3,0.54878,0.51184,0.37978,0.19814,0.07178,0.02114,0.00438,0.00442,0.00142
+   features.6,0.55128,0.53566,0.4887,0.4167,0.31178,0.19152,0.08612,0.01258,0.00236
+   features.8,0.55696,0.54194,0.48892,0.42986,0.33048,0.2266,0.09566,0.02348,0.0056
+   features.10,0.55468,0.5394,0.49576,0.4291,0.3591,0.28138,0.14256,0.05446,0.01578
+
+Topology Analysis
+-----------------
+
+We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
+
+ChannelDependency
+^^^^^^^^^^^^^^^^^
+
+Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
+
+
+.. image:: ../../img/channel_dependency_example.jpg
+   :target: ../../img/channel_dependency_example.jpg
+   :alt: 
+ 
+
+If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
+
+Usage
+"""""
+
+.. code-block:: python
+
+   from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
+   data = torch.ones(1, 3, 224, 224).cuda()
+   channel_depen = ChannelDependency(net, data)
+   channel_depen.export('dependency.csv')
+
+Output Example
+""""""""""""""
+
+The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict. 
+
+.. code-block:: bash
+
+   Dependency Set,Convolutional Layers
+   Set 1,layer1.1.conv2,layer1.0.conv2,conv1
+   Set 2,layer1.0.conv1
+   Set 3,layer1.1.conv1
+   Set 4,layer2.0.conv1
+   Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
+   Set 6,layer2.1.conv1
+   Set 7,layer3.0.conv1
+   Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
+   Set 9,layer3.1.conv1
+   Set 10,layer4.0.conv1
+   Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
+   Set 12,layer4.1.conv1
+
+MaskConflict
+^^^^^^^^^^^^
+
+When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
+
+.. code-block:: python
+
+   from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
+   fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
+
+not_safe_to_prune
+^^^^^^^^^^^^^^^^^
+
+If we try to prune a layer whose output tensor is taken as the input by a shape-constraint OP(for example, view, reshape), then such pruning maybe not be safe. For example, we have a convolutional layer followed by a view function.
+
+.. code-block:: python
+
+   x = self.conv(x) # output shape is (batch, 1024, 3, 3)
+   x = x.view(-1, 1024)
+
+If the output shape of the pruned conv layer is not divisible by 1024(for example(batch, 500, 3, 3)), we may meet a shape error. We cannot replace such a function that directly operates on the Tensor. Therefore, we need to be careful when pruning such layers. The function not_safe_to_prune finds all the layers followed by a shape-constraint function. Here is an example for usage. If you meet a shape error when running the forward inference on the speeduped model, you can exclude the layers returned by not_safe_to_prune and try again. 
+
+.. code-block:: python
+
+   not_safe = not_safe_to_prune(model, dummy_input)
+
+Model FLOPs/Parameters Counter
+------------------------------
+
+We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). 
+
+We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
+
+Usage
+^^^^^
+
+.. code-block:: python
+
+   from nni.compression.pytorch.utils.counter import count_flops_params
+
+   # Given input size (1, 1, 28, 28)
+   flops, params, results = count_flops_params(model, (1, 1, 28, 28)) 
+
+   # Given input tensor with size (1, 1, 28, 28) and switch to full mode
+   x = torch.randn(1, 1, 28, 28)
+
+   flops, params, results = count_flops_params(model, (x,) mode='full') # tuple of tensor as input
+
+   # Format output size to M (i.e., 10^6)
+   print(f'FLOPs: {flops/1e6:.3f}M,  Params: {params/1e6:.3f}M)
+   print(results)
+   {
+   'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}, 
+   'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
+   }
--- a/docs/en_US/Compression/CustomizeCompressor.rst
+++ b/docs/en_US/Compression/CustomizeCompressor.rst
+Customize New Compression Algorithm
+===================================
+
+.. contents::
+
+In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
+
+**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression <../Compression/Framework.rst>`__
+
+Customize a new pruning algorithm
+---------------------------------
+
+Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
+
+An implementation of ``weight masker`` may look like this:
+
+.. code-block:: python
+
+   class MyMasker(WeightMasker):
+       def __init__(self, model, pruner):
+           super().__init__(model, pruner)
+           # You can do some initialization here, such as collecting some statistics data
+           # if it is necessary for your algorithms to calculate the masks.
+
+       def calc_mask(self, sparsity, wrapper, wrapper_idx=None):
+           # calculate the masks based on the wrapper.weight, and sparsity, 
+           # and anything else
+           # mask = ...
+           return {'weight_mask': mask}
+
+You can reference nni provided :githublink:`weight masker <nni/algorithms/compression/pytorch/pruning/structured_pruning_masker.py>` implementations to implement your own weight masker.
+
+A basic ``pruner`` looks likes this:
+
+.. code-block:: python
+
+   class MyPruner(Pruner):
+       def __init__(self, model, config_list, optimizer):
+           super().__init__(model, config_list, optimizer)
+           self.set_wrappers_attribute("if_calculated", False)
+           # construct a weight masker instance
+           self.masker = MyMasker(model, self)
+
+       def calc_mask(self, wrapper, wrapper_idx=None):
+           sparsity = wrapper.config['sparsity']
+           if wrapper.if_calculated:
+               # Already pruned, do not prune again as a one-shot pruner
+               return None
+           else:
+               # call your masker to actually calcuate the mask for this layer
+               masks = self.masker.calc_mask(sparsity=sparsity, wrapper=wrapper, wrapper_idx=wrapper_idx)
+               wrapper.if_calculated = True
+               return masks
+
+Reference nni provided :githublink:`pruner <nni/algorithms/compression/pytorch/pruning/one_shot_pruner.py>` implementations to implement your own pruner class.
+
+----
+
+Customize a new quantization algorithm
+--------------------------------------
+
+To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
+
+.. code-block:: python
+
+   from nni.compression.pytorch import Quantizer
+
+   class YourQuantizer(Quantizer):
+       def __init__(self, model, config_list):
+           """
+           Suggest you to use the NNI defined spec for config
+           """
+           super().__init__(model, config_list)
+
+       def quantize_weight(self, weight, config, **kwargs):
+           """
+           quantize should overload this method to quantize weight tensors.
+           This method is effectively hooked to :meth:`forward` of the model.
+
+           Parameters
+           ----------
+           weight : Tensor
+               weight that needs to be quantized
+           config : dict
+               the configuration for weight quantization
+           """
+
+           # Put your code to generate `new_weight` here
+
+           return new_weight
+
+       def quantize_output(self, output, config, **kwargs):
+           """
+           quantize should overload this method to quantize output.
+           This method is effectively hooked to `:meth:`forward` of the model.
+
+           Parameters
+           ----------
+           output : Tensor
+               output that needs to be quantized
+           config : dict
+               the configuration for output quantization
+           """
+
+           # Put your code to generate `new_output` here
+
+           return new_output
+
+       def quantize_input(self, *inputs, config, **kwargs):
+           """
+           quantize should overload this method to quantize input.
+           This method is effectively hooked to :meth:`forward` of the model.
+
+           Parameters
+           ----------
+           inputs : Tensor
+               inputs that needs to be quantized
+           config : dict
+               the configuration for inputs quantization
+           """
+
+           # Put your code to generate `new_input` here
+
+           return new_input
+
+       def update_epoch(self, epoch_num):
+           pass
+
+       def step(self):
+           """
+           Can do some processing based on the model or weights binded
+           in the func bind_model
+           """
+           pass
+
+Customize backward function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ , user can customize a backward function as follow:
+
+.. code-block:: python
+
+   from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
+
+   class ClipGrad(QuantGrad):
+       @staticmethod
+       def quant_backward(tensor, grad_output, quant_type):
+           """
+           This method should be overrided by subclass to provide customized backward function,
+           default implementation is Straight-Through Estimator
+           Parameters
+           ----------
+           tensor : Tensor
+               input of quantization operation
+           grad_output : Tensor
+               gradient of the output of quantization operation
+           quant_type : QuantType
+               the type of quantization, it can be `QuantType.INPUT`, `QuantType.WEIGHT`, `QuantType.OUTPUT`,
+               you can define different behavior for different types.
+           Returns
+           -------
+           tensor
+               gradient of the input of quantization operation
+           """
+
+           # for quant_output function, set grad to zero if the absolute value of tensor is larger than 1
+           if quant_type == QuantType.OUTPUT:
+               grad_output[torch.abs(tensor) > 1] = 0
+           return grad_output
+
+
+   class YourQuantizer(Quantizer):
+       def __init__(self, model, config_list):
+           super().__init__(model, config_list)
+           # set your customized backward function to overwrite default backward function
+           self.quant_grad = ClipGrad
+
+If you do not customize ``QuantGrad``\ , the default backward is Straight-Through Estimator. 
+*Coming Soon* ...
--- a/docs/en_US/Compression/DependencyAware.rst
+++ b/docs/en_US/Compression/DependencyAware.rst
+Dependency-aware Mode for Filter Pruning
+========================================
+
+Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
+
+As `dependency analysis utils <./CompressionUtils.rst>`__ shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see `Compression Utils <./CompressionUtils.rst>`__\ ). Take the following figure as an example.
+
+
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :alt: 
+
+
+If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
+
+ To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
+
+In the dependency-aware mode(\ ``dependency_aware`` is set ``True``\ ), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+
+
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :alt: 
+
+
+Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by ``min_sparsity``\ ). According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers. Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of ``conv1`` , ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively. In this case, the ``dependency-aware pruner`` will 
+
+.. code-block:: bash
+
+   - First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`. 
+   - Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+
+
+In addition, for the convolutional layers that have more than one filter group, ``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process. 
+
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
+
+Usage
+-----
+
+In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
+
+To enable the dependency-aware mode for ``L1FilterPruner``\ :
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
+   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
+   # dummy_input is necessary for the dependency_aware mode
+   dummy_input = torch.ones(1, 3, 224, 224).cuda()
+   pruner = L1FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for L2FilterPruner
+   # pruner = L2FilterPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for FPGMPruner
+   # pruner = FPGMPruner(model, config_list, dependency_aware=True, dummy_input=dummy_input)
+   # for ActivationAPoZRankFilterPruner
+   # pruner = ActivationAPoZRankFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1, dependency_aware=True, dummy_input=dummy_input)
+   # for ActivationMeanRankFilterPruner
+   # pruner = ActivationMeanRankFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1, dependency_aware=True, dummy_input=dummy_input)
+   # for TaylorFOWeightFilterPruner
+   # pruner = TaylorFOWeightFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1, dependency_aware=True, dummy_input=dummy_input)
+
+   pruner.compress()
+
+Evaluation
+----------
+
+In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
+We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
+
+
+.. image:: ../../img/mobilev2_l1_cifar.jpg
+   :target: ../../img/mobilev2_l1_cifar.jpg
+   :alt: 
+
+
+In the figure, the ``Dependency-aware`` represents the L1FilterPruner with dependency-aware mode enabled. ``L1 Filter`` is the normal ``L1FilterPruner`` without the dependency-aware mode, and the ``No-Dependency`` means  pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
--- a/docs/en_US/Compression/Framework.rst
+++ b/docs/en_US/Compression/Framework.rst
+Framework overview of model compression
+=======================================
+
+.. contents::
+
+Below picture shows the components overview of model compression framework.
+
+
+.. image:: ../../img/compressor_framework.jpg
+   :target: ../../img/compressor_framework.jpg
+   :alt: 
+
+
+There are 3 major components/classes in NNI model compression framework: ``Compressor``\ , ``Pruner`` and ``Quantizer``. Let's look at them in detail one by one:
+
+Compressor
+----------
+
+Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+
+   # load a pretrained model or train a model before using a pruner
+
+   configure_list = [{
+       'sparsity': 0.7,
+       'op_types': ['Conv2d', 'Linear'],
+   }]
+
+   pruner = LevelPruner(model, configure_list)
+   model = pruner.compress()
+
+   # model is ready for pruning, now start finetune the model,
+   # the model will be pruned during training automatically
+
+To use a quantizer:
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
+
+   configure_list = [{
+       'quant_types': ['weight'],
+       'quant_bits': {
+           'weight': 8,
+       },
+       'op_types':['Conv2d', 'Linear']
+   }]
+   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+   quantizer = DoReFaQuantizer(model, configure_list, optimizer)
+   quantizer.compress()
+
+View :githublink:`example code <examples/model_compress>` for more information.
+
+``Compressor`` class provides some utility methods for subclass and users:
+
+Set wrapper attribute
+^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
+In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
+
+Collect data during forward
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sometimes users want to collect some data during the modules' forward method, for example, the mean value of the activation. This can be done by adding a customized collector to module.
+
+.. code-block:: python
+
+   class MyMasker(WeightMasker):
+       def __init__(self, model, pruner):
+           super().__init__(model, pruner)
+           # Set attribute `collected_activation` for all wrappers to store
+           # activations for each layer
+           self.pruner.set_wrappers_attribute("collected_activation", [])
+           self.activation = torch.nn.functional.relu
+
+           def collector(wrapper, input_, output):
+               # The collected activation can be accessed via each wrapper's collected_activation
+               # attribute
+               wrapper.collected_activation.append(self.activation(output.detach().cpu()))
+
+           self.pruner.hook_id = self.pruner.add_activation_collector(collector)
+
+The collector function will be called each time the forward method runs.
+
+Users can also remove this collector like this:
+
+.. code-block:: python
+
+   # Save the collector identifier
+   collector_id = self.pruner.add_activation_collector(collector)
+
+   # When the collector is not used any more, it can be remove using
+   # the saved collector identifier
+   self.pruner.remove_activation_collector(collector_id)
+
+----
+
+Pruner
+------
+
+A pruner receives ``model`` , ``config_list`` as arguments. 
+Some pruners like ``TaylorFOWeightFilter Pruner`` prune the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
+
+Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
+
+Weight masker
+^^^^^^^^^^^^^
+
+A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
+
+Pruning module wrapper
+^^^^^^^^^^^^^^^^^^^^^^
+
+A ``pruning module wrapper`` is a module containing:
+
+
+#. the origin module
+#. some buffers used by ``calc_mask``
+#. a new forward method that applies masks before running the original forward method.
+
+the reasons to use ``module wrapper``\ :
+
+
+#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
+#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
+
+Pruning hook
+^^^^^^^^^^^^
+
+A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner's calc_mask method at ``optimizer.step()`` is invoked.
+
+----
+
+Quantizer
+---------
+
+Quantizer class is also a subclass of ``Compressor``\ , it is used to compress models by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time. It contains:
+
+Quantization module wrapper
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Each module/layer of the model to be quantized is wrapped by a quantization module wrapper, it provides a new ``forward`` method to quantize the original module's weight, input and output.
+
+Quantization hook
+^^^^^^^^^^^^^^^^^
+
+A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
+
+Quantization methods
+^^^^^^^^^^^^^^^^^^^^
+
+``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
+
+.. code-block:: python
+
+   class Quantizer(Compressor):
+       """
+       Base quantizer for pytorch quantizer
+       """
+       def quantize_weight(self, weight, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize weight.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           weight : Tensor
+               weight that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_weight()')
+
+       def quantize_output(self, output, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize output.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           output : Tensor
+               output that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_output()')
+
+       def quantize_input(self, *inputs, wrapper, **kwargs):
+           """
+           quantize should overload this method to quantize input.
+           This method is effectively hooked to :meth:`forward` of the model.
+           Parameters
+           ----------
+           inputs : Tensor
+               inputs that needs to be quantized
+           wrapper : QuantizerModuleWrapper
+               the wrapper for origin module
+           """
+           raise NotImplementedError('Quantizer must overload quantize_input()')
+
+----
+
+Multi-GPU support
+-----------------
+
+On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
+Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.
--- a/docs/en_US/Compression/ModelSpeedup.rst
+++ b/docs/en_US/Compression/ModelSpeedup.rst
+Speed up Masked Model
+=====================
+
+*This feature is in Beta version.*
+
+Introduction
+------------
+
+Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
+to check model performance of a specific pruning (or sparsity), but there is no real speedup.
+Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
+to convert a model to a smaller one based on user provided masks (the masks come from the
+pruning algorithms).
+
+There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
+
+Design and Implementation
+-------------------------
+
+To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
+
+For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
+
+Usage
+-----
+
+.. code-block:: python
+
+   from nni.compression.pytorch import ModelSpeedup
+   # model: the model you want to speed up
+   # dummy_input: dummy input of the model, given to `jit.trace`
+   # masks_file: the mask file created by pruning algorithms
+   m_speedup = ModelSpeedup(model, dummy_input.to(device), masks_file)
+   m_speedup.speedup_model()
+   dummy_input = dummy_input.to(device)
+   start = time.time()
+   out = model(dummy_input)
+   print('elapsed time: ', time.time() - start)
+
+For complete examples please refer to :githublink:`the code <examples/model_compress/pruning/speedup/model_speedup.py>`
+
+NOTE: The current implementation supports PyTorch 1.3.1 or newer.
+
+Limitations
+-----------
+
+Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
+
+For PyTorch we can only replace modules, if functions in ``forward`` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
+
+Speedup Results of Examples
+---------------------------
+
+The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
+
+slim pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01197
+     - 0.005107
+   * - 2
+     - 0.02019
+     - 0.008769
+   * - 4
+     - 0.02733
+     - 0.014809
+   * - 8
+     - 0.04310
+     - 0.027441
+   * - 16
+     - 0.07731
+     - 0.05008
+   * - 32
+     - 0.14464
+     - 0.10027
+
+
+fpgm pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on cpu,
+input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
+too large variance
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01383
+     - 0.01839
+   * - 2
+     - 0.01167
+     - 0.003558
+   * - 4
+     - 0.01636
+     - 0.01088
+   * - 40
+     - 0.14412
+     - 0.08268
+   * - 40
+     - 1.29385
+     - 0.14408
+   * - 40
+     - 0.41035
+     - 0.46162
+   * - 400
+     - 6.29020
+     - 5.82143
+
+
+l1filter pruner example
+^^^^^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01026
+     - 0.003677
+   * - 2
+     - 0.01657
+     - 0.008161
+   * - 4
+     - 0.02458
+     - 0.020018
+   * - 8
+     - 0.03498
+     - 0.025504
+   * - 16
+     - 0.06757
+     - 0.047523
+   * - 32
+     - 0.10487
+     - 0.086442
+
+
+APoZ pruner example
+^^^^^^^^^^^^^^^^^^^
+
+on one V100 GPU,
+input tensor: ``torch.randn(64, 3, 32, 32)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Times
+     - Mask Latency
+     - Speedup Latency
+   * - 1
+     - 0.01389
+     - 0.004208
+   * - 2
+     - 0.01628
+     - 0.008310
+   * - 4
+     - 0.02521
+     - 0.014008
+   * - 8
+     - 0.03386
+     - 0.023923
+   * - 16
+     - 0.06042
+     - 0.046183
+   * - 32
+     - 0.12421
+     - 0.087113
+
+
+SimulatedAnnealing pruner example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
+We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
+The latency is measured on one V100 GPU and the input tensor is  ``torch.randn(128, 3, 32, 32)``.
+
+
+.. image:: ../../img/SA_latency_accuracy.png
+
+
+User configuration for ModelSpeedup
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+..  autoclass:: nni.compression.pytorch.ModelSpeedup
--- a/docs/en_US/Compression/Overview.rst
+++ b/docs/en_US/Compression/Overview.rst
+Model Compression with NNI
+==========================
+
+.. contents::
+
+As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
+
+NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
+
+
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speed up a compressed model to make it have lower inference latency and also make it become smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+
+
+Compression Pipeline
+--------------------
+
+.. image:: ../../img/compression_flow.jpg
+   :target: ../../img/compression_flow.jpg
+   :alt: 
+
+The overall compression pipeline in NNI. For compressing a pretrained model, pruning and quantization can be used alone or in combination. 
+
+.. note::
+  Since NNI compression algorithms are not meant to compress model while NNI speedup tool can truly compress model and reduce latency. To obtain a truly compact model, users should conduct `model speedup <./ModelSpeedup.rst>`__. The interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.
+
+Supported Algorithms
+--------------------
+
+The algorithms include pruning algorithms and quantization algorithms.
+
+Pruning Algorithms
+^^^^^^^^^^^^^^^^^^
+
+Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Level Pruner <Pruner.rst#level-pruner>`__
+     - Pruning the specified ratio on each weight based on absolute values of weights
+   * - `AGP Pruner <../Compression/Pruner.rst#agp-pruner>`__
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - `Lottery Ticket Pruner <../Compression/Pruner.rst#lottery-ticket-hypothesis>`__
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - `FPGM Pruner <../Compression/Pruner.rst#fpgm-pruner>`__
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/pdf/1811.00250.pdf>`__
+   * - `L1Filter Pruner <../Compression/Pruner.rst#l1filter-pruner>`__
+     - Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+   * - `L2Filter Pruner <../Compression/Pruner.rst#l2filter-pruner>`__
+     - Pruning filters with the smallest L2 norm of weights in convolution layers
+   * - `ActivationAPoZRankFilterPruner <../Compression/Pruner.rst#activationapozrankfilter-pruner>`__
+     - Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+   * - `ActivationMeanRankFilterPruner <../Compression/Pruner.rst#activationmeanrankfilter-pruner>`__
+     - Pruning filters based on the metric that calculates the smallest mean value of output activations
+   * - `Slim Pruner <../Compression/Pruner.rst#slim-pruner>`__
+     - Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
+   * - `TaylorFO Pruner <../Compression/Pruner.rst#taylorfoweightfilter-pruner>`__
+     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
+   * - `ADMM Pruner <../Compression/Pruner.rst#admm-pruner>`__
+     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
+   * - `NetAdapt Pruner <../Compression/Pruner.rst#netadapt-pruner>`__
+     - Automatically simplify a pretrained network to meet the resource budget by iterative pruning  `Reference Paper <https://arxiv.org/abs/1804.03230>`__
+   * - `SimulatedAnnealing Pruner <../Compression/Pruner.rst#simulatedannealing-pruner>`__
+     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AutoCompress Pruner <../Compression/Pruner.rst#autocompress-pruner>`__
+     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - `AMC Pruner <../Compression/Pruner.rst#amc-pruner>`__
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
+   * - `Transformer Head Pruner <../Compression/Pruner.rst#transformer-head-pruner>`__
+     - Pruning attention heads from transformer models either in one shot or iteratively.
+
+
+You can refer to this `benchmark <../CommunitySharings/ModelCompressionComparison.rst>`__ for the performance of these pruners on some benchmark problems.
+
+Quantization Algorithms
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Name
+     - Brief Introduction of Algorithm
+   * - `Naive Quantizer <../Compression/Quantizer.rst#naive-quantizer>`__
+     - Quantize weights to default 8 bits
+   * - `QAT Quantizer <../Compression/Quantizer.rst#qat-quantizer>`__
+     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+   * - `DoReFa Quantizer <../Compression/Quantizer.rst#dorefa-quantizer>`__
+     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
+   * - `BNN Quantizer <../Compression/Quantizer.rst#bnn-quantizer>`__
+     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
+   * - `LSQ Quantizer <../Compression/Quantizer.rst#lsq-quantizer>`__
+     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
+   * - `Observer Quantizer <../Compression/Quantizer.rst#observer-quantizer>`__
+     - Post training quantizaiton. Collect quantization information during calibration with observers.
+
+
+Model Speedup
+-------------
+
+The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Masked Model Speedup can be found `here <./ModelSpeedup.rst>`__, The detailed tutorial of Mixed Precision Quantization Model Speedup can be found `here <./QuantizationSpeedup.rst>`__.
+
+
+Compression Utilities
+---------------------
+
+Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
+
+Advanced Usage
+--------------
+
+NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.
+
+
+Reference and Feedback
+----------------------
+
+* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
+* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
+* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
+* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
+* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
--- a/docs/en_US/Compression/Pruner.rst
+++ b/docs/en_US/Compression/Pruner.rst
--- a/docs/en_US/Compression/QuantizationSpeedup.rst
+++ b/docs/en_US/Compression/QuantizationSpeedup.rst
+Speed up Mixed Precision Quantization Model (experimental)
+==========================================================
+
+
+Introduction
+------------
+
+Deep learning network has been computational intensive and memory intensive 
+which increases the difficulty of deploying deep neural network model. Quantization is a 
+fundamental technology which is widely used to reduce memory footprint and speed up inference 
+process. Many frameworks begin to support quantization, but few of them support mixed precision 
+quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will 
+not speed up the inference process. To get real speedup of mixed precision quantization and 
+help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different 
+DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model 
+with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects 
+TensorRT at this stage, and will support more backends in the future.
+
+
+Design and Implementation
+-------------------------
+
+To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.  
+Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference 
+framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and 
+TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph 
+representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx 
+model to generate inference engine. 
+
+
+Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
+Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\  ).
+After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
+
+
+After getting mixed precision engine, users can do inference with input data.
+
+
+Note
+
+
+* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
+* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
+* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
+
+
+Prerequisite
+------------
+CUDA version >= 11.0
+
+TensorRT version >= 7.2
+
+Note
+
+* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\  
+
+Usage
+-----
+quantization aware training:
+
+.. code-block:: python
+
+    # arrange bit config for QAT algorithm
+    configure_list = [{
+            'quant_types': ['weight', 'output'],
+            'quant_bits': {'weight':8, 'output':8},
+            'op_names': ['conv1']
+        }, {
+            'quant_types': ['output'],
+            'quant_bits': {'output':8},
+            'op_names': ['relu1']
+        }
+    ]
+
+    quantizer = QAT_Quantizer(model, configure_list, optimizer)
+    quantizer.compress()
+    calibration_config = quantizer.export_model(model_path, calibration_path)
+
+    engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=batch_size)
+    # build tensorrt inference engine
+    engine.compress()
+    # data should be pytorch tensor
+    output, time = engine.inference(data)
+
+
+Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
+
+
+For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
+
+
+For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
+
+
+Mnist test
+^^^^^^^^^^^^^^^^^^^
+
+on one GTX2080 GPU,
+input tensor: ``torch.randn(128, 1, 28, 28)``
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - quantization strategy
+     - Latency
+     - accuracy
+   * - all in 32bit
+     - 0.001199961
+     - 96%
+   * - mixed precision(average bit 20.4)
+     - 0.000753688
+     - 96%
+   * - all in 8bit
+     - 0.000229869
+     - 93.7%
+
+
+Cifar10 resnet18 test(train one epoch)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+on one GTX2080 GPU,
+input tensor: ``torch.randn(128, 3, 32, 32)``
+
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - quantization strategy
+     - Latency
+     - accuracy
+   * - all in 32bit
+     - 0.003286268
+     - 54.21%
+   * - mixed precision(average bit 11.55)
+     - 0.001358022
+     - 54.78%
+   * - all in 8bit
+     - 0.000859139
+     - 52.81%
\ No newline at end of file
--- a/docs/en_US/Compression/Quantizer.rst
+++ b/docs/en_US/Compression/Quantizer.rst
+Supported Quantization Algorithms on NNI
+========================================
+
+Index of supported quantization algorithms
+
+
+* `Naive Quantizer <#naive-quantizer>`__
+* `QAT Quantizer <#qat-quantizer>`__
+* `DoReFa Quantizer <#dorefa-quantizer>`__
+* `BNN Quantizer <#bnn-quantizer>`__
+* `LSQ Quantizer <#lsq-quantizer>`__
+* `Observer Quantizer <#observer-quantizer>`__
+
+Naive Quantizer
+---------------
+
+We provide Naive Quantizer to quantizer weight to default 8 bits, you can use it to test quantize algorithm without any configure.
+
+Usage
+^^^^^
+
+pytorch
+
+.. code-block:: python
+
+   model = nni.algorithms.compression.pytorch.quantization.NaiveQuantizer(model).compress()
+
+----
+
+QAT Quantizer
+-------------
+
+In `Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__\ , authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
+
+..
+
+   We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme
+
+
+   * Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization.
+   * Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
+
+
+Usage
+^^^^^
+
+You can quantize your model to 8 bits with the code below before your training code.
+
+PyTorch code
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+   model = Mnist()
+
+   config_list = [{
+       'quant_types': ['weight'],
+       'quant_bits': {
+           'weight': 8,
+       }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
+       'op_types':['Conv2d', 'Linear']
+   }, {
+       'quant_types': ['output'],
+       'quant_bits': 8,
+       'quant_start_step': 7000,
+       'op_types':['ReLU6']
+   }]
+   quantizer = QAT_Quantizer(model, config_list)
+   quantizer.compress()
+
+You can view example for more information
+
+User configuration for QAT Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of `config_list <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+
+* **quant_start_step:** int
+
+disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
+state where activation quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0
+
+Batch normalization folding
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Batch normalization folding is supported in QAT quantizer. It can be easily enabled by passing an argument `dummy_input` to
+the quantizer, like:
+
+.. code-block:: python
+
+    # assume your model takes an input of shape (1, 1, 28, 28)
+    # and dummy_input must be on the same device as the model
+    dummy_input = torch.randn(1, 1, 28, 28)
+
+    # pass the dummy_input to the quantizer
+    quantizer = QAT_Quantizer(model, config_list, dummy_input=dummy_input)
+
+
+The quantizer will automatically detect Conv-BN patterns and simulate batch normalization folding process in the training
+graph. Note that when the quantization aware training process is finished, the folded weight/bias would be restored after calling
+`quantizer.export_model`.
+
+Quantization dtype and scheme customization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Different backends on different devices use different quantization strategies (i.e. dtype (int or uint) and
+scheme (per-tensor or per-channel and symmetric or affine)). QAT quantizer supports customization of mainstream dtypes and schemes.
+There are two ways to set them. One way is setting them globally through a function named `set_quant_scheme_dtype` like:
+
+.. code-block:: python
+
+    from nni.compression.pytorch.quantization.settings import set_quant_scheme_dtype
+
+    # This will set all the quantization of 'input' in 'per_tensor_affine' and 'uint' manner
+    set_quant_scheme_dtype('input', 'per_tensor_affine', 'uint)
+    # This will set all the quantization of 'output' in 'per_tensor_symmetric' and 'int' manner
+    set_quant_scheme_dtype('output', 'per_tensor_symmetric', 'int')
+    # This will set all the quantization of 'weight' in 'per_channel_symmetric' and 'int' manner
+    set_quant_scheme_dtype('weight', 'per_channel_symmetric', 'int')
+
+
+The other way is more detailed. You can customize the dtype and scheme in each quantization config list like:
+
+.. code-block:: python
+
+    config_list = [{
+       'quant_types': ['weight'],
+       'quant_bits':  8,
+       'op_types':['Conv2d', 'Linear'],
+       'quant_dtype': 'int',
+       'quant_scheme': 'per_channel_symmetric'
+   }, {
+       'quant_types': ['output'],
+       'quant_bits': 8,
+       'quant_start_step': 7000,
+       'op_types':['ReLU6'],
+       'quant_dtype': 'uint',
+       'quant_scheme': 'per_tensor_affine'
+   }]
+
+Multi-GPU training
+^^^^^^^^^^^^^^^^^^^
+QAT quantizer natively supports multi-gpu training (DataParallel and DistributedDataParallel). Note that the quantizer
+instantiation should happen before you wrap your model with DataParallel or DistributedDataParallel. For example:
+
+.. code-block:: python
+
+    from torch.nn.parallel import DistributedDataParallel as DDP
+    from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+
+    model = define_your_model()
+
+    model = QAT_Quantizer(model, **other_params)  # <--- QAT_Quantizer instantiation
+
+    model = DDP(model)
+
+    for i in range(epochs):
+        train(model)
+        eval(model)
+
+
+----
+
+LSQ Quantizer
+-------------
+
+In `LEARNED STEP SIZE QUANTIZATION <https://arxiv.org/pdf/1902.08153.pdf>`__\ , authors Steven K. Esser and Jeffrey L. McKinstry provide an algorithm to train the scales with gradients.
+
+..
+
+   The authors introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer’s quantizer step size, such that it can be learned in conjunction with other network parameters.
+
+
+Usage
+^^^^^
+You can add codes below before your training codes. Three things must be done:
+
+
+1. configure which layer to be quantized and which tensor (input/output/weight) of that layer to be quantized.
+2. construct the lsq quantizer
+3. call the `compress` API
+
+
+PyTorch code
+
+.. code-block:: python
+
+    from nni.algorithms.compression.pytorch.quantization import LsqQuantizer
+    model = Mnist()
+
+    configure_list = [{
+            'quant_types': ['weight', 'input'],
+            'quant_bits': {
+                'weight': 8,
+                'input': 8,
+            },
+            'op_names': ['conv1']
+        }, {
+            'quant_types': ['output'],
+            'quant_bits': {'output': 8,},
+            'op_names': ['relu1']
+    }]
+
+    quantizer = LsqQuantizer(model, configure_list, optimizer)
+    quantizer.compress()
+
+You can view example for more information. :githublink:`examples/model_compress/quantization/LSQ_torch_quantizer.py <examples/model_compress/quantization/LSQ_torch_quantizer.py>`
+
+User configuration for LSQ Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of `config_list <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+
+----
+
+DoReFa Quantizer
+----------------
+
+In `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ , authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
+
+Usage
+^^^^^
+
+To implement DoReFa Quantizer, you can add code below before your training code
+
+PyTorch code
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
+   config_list = [{ 
+       'quant_types': ['weight'],
+       'quant_bits': 8, 
+       'op_types': ['default'] 
+   }]
+   quantizer = DoReFaQuantizer(model, config_list)
+   quantizer.compress()
+
+You can view example for more information
+
+User configuration for DoReFa Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+----
+
+BNN Quantizer
+-------------
+
+In `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , 
+
+..
+
+   We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
+
+
+Usage
+^^^^^
+
+PyTorch code
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
+   model = VGG_Cifar10(num_classes=10)
+
+   configure_list = [{
+       'quant_bits': 1,
+       'quant_types': ['weight'],
+       'op_types': ['Conv2d', 'Linear'],
+       'op_names': ['features.0', 'features.3', 'features.7', 'features.10', 'features.14', 'features.17', 'classifier.0', 'classifier.3']
+   }, {
+       'quant_bits': 1,
+       'quant_types': ['output'],
+       'op_types': ['Hardtanh'],
+       'op_names': ['features.6', 'features.9', 'features.13', 'features.16', 'features.20', 'classifier.2', 'classifier.5']
+   }]
+
+   quantizer = BNNQuantizer(model, configure_list)
+   model = quantizer.compress()
+
+You can view example :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py <examples/model_compress/quantization/BNN_quantizer_cifar10.py>` for more information.
+
+User configuration for BNN Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
+
+configuration needed by this algorithm :
+
+Experiment
+^^^^^^^^^^
+
+We implemented one of the experiments in `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
+
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+
+   * - Model
+     - Accuracy
+   * - VGGNet
+     - 86.93%
+
+
+The experiments code can be found at :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py <examples/model_compress/quantization/BNN_quantizer_cifar10.py>` 
+
+
+Observer Quantizer
+------------------
+
+..
+
+   Observer quantizer is a framework of post-training quantization. It will insert observers into the place where the quantization will happen. During quantization calibration, each observer will record all the tensors it 'sees'. These tensors will be used to calculate the quantization statistics after calibration.
+
+Usage
+^^^^^
+
+1. configure which layer to be quantized and which tensor (input/output/weight) of that layer to be quantized.
+2. construct the observer quantizer.
+3. do quantization calibration.
+4. call the `compress` API to calculate the scale and zero point for each tensor and switch model to evaluation mode.
+
+PyTorch code
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
+
+   def calibration(model, calib_loader):
+       model.eval()
+       with torch.no_grad():
+           for data, _ in calib_loader:
+               model(data)
+
+   model = Mnist()
+
+   configure_list = [{
+       'quant_bits': 8,
+       'quant_types': ['weight', 'input'],
+       'op_names': ['conv1', 'conv2],
+   }, {
+       'quant_bits': 8,
+       'quant_types': ['output'],
+       'op_names': ['relu1', 'relu2],
+   }]
+
+   quantizer = ObserverQuantizer(model, configure_list)
+   calibration(model, calib_loader)
+   model = quantizer.compress()
+
+You can view example :githublink:`examples/model_compress/quantization/observer_quantizer.py <examples/model_compress/quantization/observer_quantizer.py>` for more information.
+
+User configuration for Observer Quantizer
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Common configuration needed by compression algorithms can be found at `Specification of `config_list <./QuickStart.rst>`__.
+
+
+.. note::
+    This quantizer is still under development for now. Some quantizer settings are hard-coded:
+
+    - weight observer: per_tensor_symmetric, qint8
+    - output observer: per_tensor_affine, quint8, reduce_range=True
+
+    Other settings (such as quant_type and op_names) can be configured.
+
+About the compress API
+^^^^^^^^^^^^^^^^^^^^^^
+Before the `compress` API is called, the model will only record tensors' statistics and no quantization process will be executed.
+After the `compress` API is called, the model will NOT record tensors' statistics any more. The quantization scale and zero point will
+be generated for each tensor and will be used to quantize each tensor during inference (we call it evaluation mode)
+
+About calibration
+^^^^^^^^^^^^^^^^^
+Usually we pick up about 100 training/evaluation examples for calibration. If you found the accuracy is a bit low, try
+to reduce the number of calibration examples.
+
--- a/docs/en_US/Compression/QuickStart.rst
+++ b/docs/en_US/Compression/QuickStart.rst
+Quick Start
+===========
+
+..  toctree::
+    :hidden:
+
+    Notebook Example <compression_pipeline_example>
+
+
+Model compression usually consists of three stages: 1) pre-training a model, 2) compress the model, 3) fine-tuning the model. NNI mainly focuses on the second stage and provides very simple APIs for compressing a model. Follow this guide for a quick look at how easy it is to use NNI to compress a model. 
+
+A `compression pipeline example <./compression_pipeline_example.rst>`__ with Jupyter notebook is supported and refer the code :githublink:`here <examples/notebooks/compression_pipeline_example.ipynb>`.
+
+Model Pruning
+-------------
+
+Here we use `level pruner <../Compression/Pruner.rst#level-pruner>`__ as an example to show the usage of pruning in NNI.
+
+Step1. Write configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``default``\ ops to sparsity 0.5 while keeping other layers unpruned.
+
+.. code-block:: python
+
+   config_list = [{
+       'sparsity': 0.5,
+       'op_types': ['default'],
+   }]
+
+The specification of configuration can be found `here <./Tutorial.rst#specify-the-configuration>`__. Note that different pruners may have their own defined fields in configuration. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
+
+Step2. Choose a pruner and compress the model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model. Note that, some algorithms may check gradients for compressing, so we may also define a trainer, an optimizer, a criterion and pass them to the pruner.
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+
+   pruner = LevelPruner(model, config_list)
+   model = pruner.compress()
+
+Some pruners (e.g., L1FilterPruner, FPGMPruner) prune once, some pruners (e.g., AGPPruner) prune your model iteratively, the masks are adjusted epoch by epoch during training.
+
+So if the pruners prune your model iteratively or they need training or inference to get gradients, you need pass finetuning logic to pruner.
+
+For example:
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import AGPPruner
+
+   pruner = AGPPruner(model, config_list, optimizer, trainer, criterion, num_iterations=10, epochs_per_iteration=1, pruning_algorithm='level')
+   model = pruner.compress()
+
+Step3. Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After training, you can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.
+
+.. code-block:: python
+
+   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
+
+Plese refer to :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for example code.
+
+More examples of pruning algorithms can be found in :githublink:`basic_pruners_torch <examples/model_compress/pruning/basic_pruners_torch.py>` and :githublink:`auto_pruners_torch <examples/model_compress/pruning/auto_pruners_torch.py>`.
+
+
+Model Quantization
+------------------
+
+Here we use `QAT  Quantizer <../Compression/Quantizer.rst#qat-quantizer>`__ as an example to show the usage of pruning in NNI.
+
+Step1. Write configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   config_list = [{
+       'quant_types': ['weight', 'input'],
+       'quant_bits': {
+           'weight': 8,
+           'input': 8,
+       }, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
+       'op_types':['Conv2d', 'Linear'],
+       'quant_dtype': 'int',
+       'quant_scheme': 'per_channel_symmetric'
+   }, {
+       'quant_types': ['output'],
+       'quant_bits': 8,
+       'quant_start_step': 7000,
+       'op_types':['ReLU6'],
+       'quant_dtype': 'uint',
+       'quant_scheme': 'per_tensor_affine'
+   }]
+
+The specification of configuration can be found `here <./Tutorial.rst#quantization-specific-keys>`__.
+
+Step2. Choose a quantizer and compress the model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
+
+   quantizer = QAT_Quantizer(model, config_list)
+   quantizer.compress()
+
+
+Step3. Export compression result
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After training and calibration, you can export model weight to a file, and the generated calibration parameters to a file as well. Exporting onnx model is also supported.
+
+.. code-block:: python
+
+   calibration_config = quantizer.export_model(model_path, calibration_path, onnx_path, input_shape, device)
+
+Plese refer to :githublink:`mnist example <examples/model_compress/quantization/QAT_torch_quantizer.py>` for example code.
+
+Congratulations! You've compressed your first model via NNI. To go a bit more in depth about model compression in NNI, check out the `Tutorial <./Tutorial.rst>`__.
\ No newline at end of file
--- a/docs/en_US/Compression/Tutorial.rst
+++ b/docs/en_US/Compression/Tutorial.rst
--- a/docs/en_US/Compression/advanced.rst
+++ b/docs/en_US/Compression/advanced.rst
+Advanced Usage
+==============
+
+..  toctree::
+    :maxdepth: 2
+
+    Framework <./Framework>
+    Customize a new algorithm <./CustomizeCompressor>
+    Automatic Model Compression (Beta) <./AutoCompression>
--- a/docs/en_US/Compression/compression_pipeline_example.ipynb
+++ b/docs/en_US/Compression/compression_pipeline_example.ipynb
--- a/docs/en_US/Compression/pruning.rst
+++ b/docs/en_US/Compression/pruning.rst
+#################
+Pruning
+#################
+
+Pruning is a common technique to compress neural network models.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
+
+From pruning granularity perspective, fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
+
+NNI provides multiple unstructured pruning and structured pruning algorithms.
+It supports Tensorflow and PyTorch with unified interface.
+For users to prune their models, they only need to add several lines in their code.
+For the structured filter pruning, NNI also provides a dependency-aware mode. In the dependency-aware mode, the
+filter pruner will get better speed gain after the speedup.
+
+For details, please refer to the following tutorials:
+
+..  toctree::
+    :maxdepth: 2
+
+    Pruners <Pruner>
+    Dependency Aware Mode <DependencyAware>
+    Model Speedup <ModelSpeedup>
--- a/docs/en_US/Compression/quantization.rst
+++ b/docs/en_US/Compression/quantization.rst
+#################
+Quantization
+#################
+
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
+which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
+format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
+can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
+is an active field of research.
+
+A quantizer is a quantization algorithm implementation in NNI, NNI provides multiple quantizers as below. You can also
+create your own quantizer using NNI model compression interface.
+
+..  toctree::
+    :maxdepth: 2
+
+    Quantizers <Quantizer>
+    Quantization Speedup <QuantizationSpeedup>