Refactoring model compression doc (#2919)

1ae8a0db · chicm-ms · GitHub · 8d3f444a · 1ae8a0db · 1ae8a0db
Unverified Commit 1ae8a0db authored Oct 10, 2020 by chicm-ms Committed by GitHub Oct 10, 2020
20 changed files
--- a/docs/en_US/CommunitySharings/ModelCompressionComparison.md
+++ b/docs/en_US/CommunitySharings/ModelCompressionComparison.md
@@ -23,7 +23,7 @@ The experiments are performed with the following pruners/datasets/models:
    
    For the pruners with scheduling, `L1Filter Pruner` is used as the base algorithm. That is to say, after the sparsities distribution is decided by the scheduling algorithm, `L1Filter Pruner` is used to performn real pruning.

-    - All the pruners listed above are implemented in [nni](https://github.com/microsoft/nni/tree/master/docs/en_US/Compressor/Overview.md).
+    - All the pruners listed above are implemented in [nni](https://github.com/microsoft/nni/tree/master/docs/en_US/Compression/Overview.md).

 ## Experiment Result

@@ -60,7 +60,7 @@ From the experiment result, we get the following conclusions:

 * The experiment results are all collected with the default configuration of the pruners in nni, which means that when we call a pruner class in nni, we don't change any default class arguments.

-* Both FLOPs and the number of parameters are counted with [Model FLOPs/Parameters Counter](https://github.com/microsoft/nni/tree/master/docs/en_US/Compressor/CompressionUtils.md#model-flopsparameters-counter) after [model speed up](https://github.com/microsoft/nni/tree/master/docs/en_US/Compressor/ModelSpeedup.md).
+* Both FLOPs and the number of parameters are counted with [Model FLOPs/Parameters Counter](https://github.com/microsoft/nni/tree/master/docs/en_US/Compression/CompressionUtils.md#model-flopsparameters-counter) after [model speed up](https://github.com/microsoft/nni/tree/master/docs/en_US/Compression/ModelSpeedup.md).
 This avoids potential issues of counting them of masked models.

 * The experiment code can be found [here]( https://github.com/microsoft/nni/tree/master/examples/model_compress/auto_pruners_torch.py).

--- a/docs/en_US/Compressor/AutoCompression.md
+++ b/docs/en_US/Compressor/AutoCompression.md
-# Automatic Model Compression on NNI
+# Automatic Model Pruning using NNI Tuners

-It's convenient to implement auto model compression with NNI compression and NNI tuners
+It's convenient to implement auto model pruning with NNI compression and NNI tuners

 ## First, model compression with NNI


--- a/docs/en_US/Compressor/CompressionReference.md
+++ b/docs/en_US/Compressor/CompressionReference.md
--- a/docs/en_US/Compressor/CompressionUtils.md
+++ b/docs/en_US/Compressor/CompressionUtils.md
--- a/docs/en_US/Compressor/CustomizeCompressor.md
+++ b/docs/en_US/Compressor/CustomizeCompressor.md
@@ -6,7 +6,7 @@

 In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.

-**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference [Framework overview of model compression](https://nni.readthedocs.io/en/latest/Compressor/Framework.html)
+**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference [Framework overview of model compression](https://nni.readthedocs.io/en/latest/Compression/Framework.html)


 ## Customize a new pruning algorithm

--- a/docs/en_US/Compressor/DependencyAware.md
+++ b/docs/en_US/Compressor/DependencyAware.md
--- a/docs/en_US/Compressor/Framework.md
+++ b/docs/en_US/Compressor/Framework.md
--- a/docs/en_US/Compressor/ModelSpeedup.md
+++ b/docs/en_US/Compressor/ModelSpeedup.md
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -27,20 +27,20 @@ Pruning algorithms compress the original network by removing redundant weights o

 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Level Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
-| [AGP Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
-| [Lottery Ticket Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#lottery-ticket-hypothesis) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
-| [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
-| [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
-| [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
-| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationapozrankfilterpruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
-| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#activationmeanrankfilterpruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
-| [Slim Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
-| [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
-| [ADMM Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#admm-pruner) | Pruning based on ADMM optimization technique [Reference Paper](https://arxiv.org/abs/1804.03294) |
-| [NetAdapt Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#netadapt-pruner) | Automatically simplify a pretrained network to meet the resource budget by iterative pruning  [Reference Paper](https://arxiv.org/abs/1804.03230) |
-| [SimulatedAnnealing Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#simulatedannealing-pruner) | Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm [Reference Paper](https://arxiv.org/abs/1907.03141) |
-| [AutoCompress Pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#autocompress-pruner) | Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner [Reference Paper](https://arxiv.org/abs/1907.03141) |
+| [Level Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#level-pruner) | Pruning the specified ratio on each weight based on absolute values of weights |
+| [AGP Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#agp-pruner) | Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) [Reference Paper](https://arxiv.org/abs/1710.01878)|
+| [Lottery Ticket Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#lottery-ticket-hypothesis) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
+| [FPGM Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
+| [L1Filter Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#l1filter-pruner) | Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) [Reference Paper](https://arxiv.org/abs/1608.08710) |
+| [L2Filter Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#l2filter-pruner) | Pruning filters with the smallest L2 norm of weights in convolution layers |
+| [ActivationAPoZRankFilterPruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#activationapozrankfilterpruner) | Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. [Reference Paper](https://arxiv.org/abs/1607.03250) |
+| [ActivationMeanRankFilterPruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#activationmeanrankfilterpruner) | Pruning filters based on the metric that calculates the smallest mean value of output activations |
+| [Slim Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) [Reference Paper](https://arxiv.org/abs/1708.06519) |
+| [TaylorFO Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#taylorfoweightfilterpruner) | Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) [Reference Paper](http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf) |
+| [ADMM Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#admm-pruner) | Pruning based on ADMM optimization technique [Reference Paper](https://arxiv.org/abs/1804.03294) |
+| [NetAdapt Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#netadapt-pruner) | Automatically simplify a pretrained network to meet the resource budget by iterative pruning  [Reference Paper](https://arxiv.org/abs/1804.03230) |
+| [SimulatedAnnealing Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#simulatedannealing-pruner) | Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm [Reference Paper](https://arxiv.org/abs/1907.03141) |
+| [AutoCompress Pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#autocompress-pruner) | Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner [Reference Paper](https://arxiv.org/abs/1907.03141) |

 You can refer to this [benchmark](https://github.com/microsoft/nni/tree/master/docs/en_US/CommunitySharings/ModelCompressionComparison.md) for the performance of these pruners on some benchmark problems.

@@ -50,14 +50,14 @@ Quantization algorithms compress the original network by reducing the number of

 |Name|Brief Introduction of Algorithm|
 |---|---|
-| [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#naive-quantizer) |  Quantize weights to default 8 bits |
-| [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
-| [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
-| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compressor/Quantizer.html#bnn-quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|
+| [Naive Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#naive-quantizer) |  Quantize weights to default 8 bits |
+| [QAT Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
+| [DoReFa Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
+| [BNN Quantizer](https://nni.readthedocs.io/en/latest/Compression/Quantizer.html#bnn-quantizer) | Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. [Reference Paper](https://arxiv.org/abs/1602.02830)|

 ## Automatic Model Compression

-Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found [here](./AutoCompression.md).
+Given targeted compression ratio, it is pretty hard to obtain the best compressed ratio in a one shot manner. An automatic model compression algorithm usually need to explore the compression space by compressing different layers with different sparsities. NNI provides such algorithms to free users from specifying sparsity of each layer in a model. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Detailed document can be found [here](./AutoPruningUsingTuners.md).

 ## Model Speedup


--- a/docs/en_US/Compressor/Pruner.md
+++ b/docs/en_US/Compressor/Pruner.md
--- a/docs/en_US/Compressor/Quantizer.md
+++ b/docs/en_US/Compressor/Quantizer.md
--- a/docs/en_US/Compressor/QuickStart.md
+++ b/docs/en_US/Compressor/QuickStart.md
@@ -8,7 +8,7 @@ In this tutorial, we use the [first section](#quick-start-to-compress-a-model) t

 ## Quick Start to Compress a Model

-NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#slim-pruner) as an example to show the usage.
+NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use [slim pruner](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#slim-pruner) as an example to show the usage.

 ### Write configuration

@@ -175,7 +175,7 @@ In this example, 'op_names' is the name of layer and four layers will be quantiz

 ### APIs for Updating Fine Tuning Status

-Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compressor/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.
+Some compression algorithms use epochs to control the progress of compression (e.g. [AGP](https://nni.readthedocs.io/en/latest/Compression/Pruner.html#agp-pruner)), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: `pruner.update_epoch(epoch)` and `pruner.step()`.

 `update_epoch` should be invoked in every epoch, while `step` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.


--- a/docs/en_US/Compression/pruning.rst
+++ b/docs/en_US/Compression/pruning.rst
+#################
+Pruning
+#################
+
+Pruning is a common technique to compress neural network models.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
+
+From pruning granularity perspective, fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
+
+NNI provides multiple unstructured pruning and structured pruning algorithms.
+It supports Tensorflow and PyTorch with unified interface.
+For users to prune their models, they only need to add several lines in their code.
+For the structured filter pruning, NNI also provides a dependency-aware mode. In the dependency-aware mode, the
+filter pruner will get better speed gain after the speedup.
+
+For details, please refer to the following tutorials:
+
+..  toctree::
+    :maxdepth: 2
+
+    Pruners <Pruner>
+    Dependency Aware Mode <DependencyAware>
+    Model Speedup <ModelSpeedup>
+    Automatic Model Pruning with NNI Tuners <AutoPruningUsingTuners>
--- a/docs/en_US/Compression/quantization.rst
+++ b/docs/en_US/Compression/quantization.rst
+#################
+Quantization
+#################
+
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
+which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
+format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
+can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
+is an active field of research.
+
+A quantizer is a quantization algorithm implementation in NNI, NNI provides multiple quntizers as below. You can also
+create your own quntizer using NNI model compression interface.
+
+..  toctree::
+    :maxdepth: 2
+
+    Quantizers <Quantizer>
--- a/docs/en_US/FeatureEngineering/Overview.md
+++ b/docs/en_US/FeatureEngineering/Overview.md
@@ -267,5 +267,5 @@ The code could be refenrence `/examples/feature_engineering/gradient_feature_sel
 * To [report a bug](https://github.com/microsoft/nni/issues/new?template=bug-report.md) for this feature in GitHub;
 * To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
 * To know more about [Neural Architecture Search with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/NAS/Overview.md);
-* To know more about [Model Compression with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Compressor/Overview.md);
+* To know more about [Model Compression with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Compression/Overview.md);
 * To know more about [Hyperparameter Tuning with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md);
--- a/docs/en_US/Overview.md
+++ b/docs/en_US/Overview.md
@@ -63,9 +63,11 @@ NNI has support for many one-shot NAS algorithms such as ENAS and DARTS through
 Other than one-shot NAS, NAS can also run in a classic mode where each candidate architecture runs as an independent trial job. In this mode, similar to hyperparameter tuning, users have to start an NNI experiment and choose a tuner for NAS.

 ### Model Compression
-Model Compression on NNI includes pruning algorithms and quantization algorithms. These algorithms are provided through NNI trial SDK. Users can directly use them in their trial code and run the trial code without starting an NNI experiment. A detailed description of model compression and its usage can be found [here](Compressor/Overview.md).
+NNI provides an easy-to-use model compression framework to compress deep neural networks, the compressed networks typically have much smaller model size and much faster
+inference speed without losing performance significantlly. Model compression on NNI includes pruning algorithms and quantization algorithms. NNI provides many pruning and
+quantization algorithms through NNI trial SDK. Users can directly use them in their trial code and run the trial code without starting an NNI experiment. Users can also use NNI model compression framework to customize their own pruning and quantization algorithms.

-There are different types of hyperparameters in model compression. One type is the hyperparameters in input configuration (e.g., sparsity, quantization bits) to a compression algorithm. The other type is the hyperparameters in compression algorithms. Here, Hyperparameter tuning of NNI can help a lot in finding the best compressed model automatically. A simple example can be found [here](Compressor/AutoCompression.md).
+A detailed description of model compression and its usage can be found [here](Compression/Overview.md).

 ### Automatic Feature Engineering
 Automatic feature engineering is for users to find the best features for their tasks. A detailed description of automatic feature engineering and its usage can be found [here](FeatureEngineering/Overview.md). It is supported through NNI trial SDK, which means you do not have to create an NNI experiment. Instead, simply import a built-in auto-feature-engineering algorithm in your trial code and directly run your trial code. 
@@ -85,5 +87,5 @@ The auto-feature-engineering algorithms usually have a bunch of hyperparameters
 * [How to run an experiment on OpenPAI?](TrainingService/PaiMode.md)
 * [Examples](TrialExample/MnistExamples.md)
 * [Neural Architecture Search on NNI](NAS/Overview.md)
-* [Automatic model compression on NNI](Compressor/Overview.md)
+* [Model Compression on NNI](Compression/Overview.md)
 * [Automatic feature engineering on NNI](FeatureEngineering/Overview.md)
--- a/docs/en_US/TrialExample/KDExample.md
+++ b/docs/en_US/TrialExample/KDExample.md
-Knowledge Distillation on NNI Compressor
+Knowledge Distillation on NNI
 ===

 ## KnowledgeDistill

--- a/docs/en_US/Tuner/BuiltinTuner.md
+++ b/docs/en_US/Tuner/BuiltinTuner.md
@@ -487,4 +487,4 @@ Note that, to use this tuner, your trial code should be modified accordingly, pl
 * To [file a feature or improvement request](https://github.com/microsoft/nni/issues/new?template=enhancement.md) for this feature in GitHub;
 * To know more about [Feature Engineering with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/FeatureEngineering/Overview.md);
 * To know more about [NAS with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/NAS/Overview.md);
-* To know more about [Model Compression with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Compressor/Overview.md);
+* To know more about [Model Compression with NNI](https://github.com/microsoft/nni/blob/master/docs/en_US/Compression/Overview.md);
--- a/docs/en_US/model_compression.rst
+++ b/docs/en_US/model_compression.rst
@@ -2,7 +2,15 @@
 Model Compression
 #################

-NNI provides an easy-to-use toolkit to help user design and use compression algorithms.
+Deep neural networks (DNNs) have achieved great success in many tasks. However, typical neural networks are both
+computationally expensive and energy intensive, can be difficult to be deployed on devices with low computation
+resources or with strict latency requirements. Therefore, a natural thought is to perform model compression to
+reduce model size and accelerate model training/inference without losing performance significantly. Model compression
+techniques can be divided into two categories: pruning and quantization. The pruning methods explore the redundancy
+in the model weights and try to remove/prune the redundant and uncritical weights. Quantization refers to compressing
+models by reducing the number of bits required to represent weights or activations.
+
+NNI provides an easy-to-use toolkit to help user design and use model pruning and quantization algorithms.
 It supports Tensorflow and PyTorch with unified interface.
 For users to compress their models, they only need to add several lines in their code.
 There are some popular model compression algorithms built-in in NNI.
@@ -15,12 +23,10 @@ For details, please refer to the following tutorials:
 ..  toctree::
    :maxdepth: 2

-    Overview <Compressor/Overview>
-    Quick Start <Compressor/QuickStart>
-    Pruning <pruning>
-    Quantizers <Compressor/Quantizer>
-    Automatic Model Compression <Compressor/AutoCompression>
-    Model Speedup <Compressor/ModelSpeedup>
-    Compression Utilities <Compressor/CompressionUtils>
-    Compression Framework <Compressor/Framework>
-    Customize Compression Algorithms <Compressor/CustomizeCompressor>
+    Overview <Compression/Overview>
+    Quick Start <Compression/QuickStart>
+    Pruning <Compression/pruning>
+    Quantization <Compression/quantization>
+    Utilities <Compression/CompressionUtils>
+    Framework <Compression/Framework>
+    Customize Model Compression Algorithms <Compression/CustomizeCompressor>
--- a/docs/en_US/pruning.rst
+++ b/docs/en_US/pruning.rst
-#################
-Pruning
-#################
-
-NNI provides several pruning algorithms that support fine-grained weight pruning and structural filter pruning.
-It supports Tensorflow and PyTorch with unified interface.
-For users to prune their models, they only need to add several lines in their code.
-For the structural filter pruning, NNI also provides a dependency-aware mode. In the dependency-aware mode, the
-filter pruner will get better speed gain after the speedup.
-
-For details, please refer to the following tutorials:
-
-..  toctree::
-    :maxdepth: 2
-
-    Pruners <Compressor/Pruner>
-    Dependency Aware Mode <Compressor/DependencyAware>