Refactor model compression examples (#3326)

a9dcc006 · colorjam · GitHub · 5946b4a4 · a9dcc006 · a9dcc006
Unverified Commit a9dcc006 authored Feb 04, 2021 by colorjam Committed by GitHub Feb 04, 2021
19 changed files
--- a/docs/en_US/Compression/AutoPruningUsingTuners.rst
+++ b/docs/en_US/Compression/AutoPruningUsingTuners.rst
@@ -6,116 +6,70 @@ It's convenient to implement auto model pruning with NNI compression and NNI tun
 First, model compression with NNI
 ---------------------------------
-You can easily compress a model with NNI compression. Take pruning for example, you can prune a pretrained model with LevelPruner like this
+You can easily compress a model with NNI compression. Take pruning for example, you can prune a pretrained model with L2FilterPruner like this
 .. code-block:: python
-   from nni.algorithms.compression.pytorch.pruning import LevelPruner
+   from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
-   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
+   config_list = [{ 'sparsity': 0.5, 'op_types': ['Conv2d'] }]
-   pruner = LevelPruner(model, config_list)
+   pruner = L2FilterPruner(model, config_list)
   pruner.compress()
-The 'default' op_type stands for the module types defined in :githublink:`default_layers.py <nni/compression/pytorch/default_layers.py>` for pytorch.
+The 'Conv2d' op_type stands for the module types defined in :githublink:`default_layers.py <nni/compression/pytorch/default_layers.py>` for pytorch.
-Therefore ``{ 'sparsity': 0.8, 'op_types': ['default'] }``\ means that **all layers with specified op_types will be compressed with the same 0.8 sparsity**. When ``pruner.compress()`` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
+Therefore ``{ 'sparsity': 0.5, 'op_types': ['Conv2d'] }``\ means that **all layers with specified op_types will be compressed with the same 0.5 sparsity**. When ``pruner.compress()`` called, the model is compressed with masks and after that you can normally fine tune this model and **pruned weights won't be updated** which have been masked.
 Then, make this automatic
 -------------------------
-The previous example manually choosed LevelPruner and pruned all layers with the same sparsity, this is obviously sub-optimal because different layers may have different redundancy. Layer sparsity should be carefully tuned to achieve least model performance degradation and this can be done with NNI tuners.
+The previous example manually chose L2FilterPruner and pruned with a specified sparsity. Different sparsity and different pruners may have different effects on different models. This process can be done with NNI tuners.
-The first thing we need to do is to design a search space, here we use a nested search space which contains  choosing pruning algorithm and optimizing layer sparsity.
+Firstly, modify our codes for few lines
-.. code-block:: json
-   {
-     "prune_method": {
-       "_type": "choice",
-       "_value": [
-         {
-           "_name": "agp",
-           "conv0_sparsity": {
-             "_type": "uniform",
-             "_value": [
-               0.1,
-               0.9
-             ]
-           },
-           "conv1_sparsity": {
-             "_type": "uniform",
-             "_value": [
-               0.1,
-               0.9
-             ]
-           },
-         },
-         {
-           "_name": "level",
-           "conv0_sparsity": {
-             "_type": "uniform",
-             "_value": [
-               0.1,
-               0.9
-             ]
-           },
-           "conv1_sparsity": {
-             "_type": "uniform",
-             "_value": [
-               0.01,
-               0.9
-             ]
-           },
-         }
-       ]
-     }
-   }
-Then we need to modify our codes for few lines
 .. code-block:: python
    import nni
    from nni.algorithms.compression.pytorch.pruning import *
    params = nni.get_parameters()
-   conv0_sparsity = params['prune_method']['conv0_sparsity']
+    sparsity = params['sparsity']
-   conv1_sparsity = params['prune_method']['conv1_sparsity']
+    pruner_name = params['pruner']
-   # these raw sparsity should be scaled if you need total sparsity constrained
+    model_name = params['model']
-   config_list_level = [{ 'sparsity': conv0_sparsity, 'op_name': 'conv0' },
-                        { 'sparsity': conv1_sparsity, 'op_name': 'conv1' }]
+    model, pruner = get_model_pruner(model_name, pruner_name, sparsity)
-   config_list_agp = [{'initial_sparsity': 0, 'final_sparsity': conv0_sparsity,
-                       'start_epoch': 0, 'end_epoch': 3,
-                       'frequency': 1,'op_name': 'conv0' },
-                      {'initial_sparsity': 0, 'final_sparsity': conv1_sparsity,
-                       'start_epoch': 0, 'end_epoch': 3,
-                       'frequency': 1,'op_name': 'conv1' },]
-   PRUNERS = {'level':LevelPruner(model, config_list_level), 'agp':AGPPruner(model, config_list_agp)}
-   pruner = PRUNERS(params['prune_method']['_name'])
    pruner.compress()
-   ... # fine tuning
-   acc = evaluate(model) # evaluation
+    train(model)  # your code for fine-tuning the model
+    acc = test(model)  # test the fine-tuned model
    nni.report_final_results(acc)
-Last, define our task and automatically tuning pruning methods with layers sparsity
+Then, define a ``config`` file in YAML to automatically tuning model, pruning algorithm and sparsity.
 .. code-block:: yaml
-   authorName: default
+    searchSpace:
-   experimentName: Auto_Compression
+    sparsity:
-   trialConcurrency: 2
+      _type: choice
-   maxExecDuration: 100h
+      _value: [0.25, 0.5, 0.75]
-   maxTrialNum: 500
+    pruner:
-   #choice: local, remote, pai
+      _type: choice
-   trainingServicePlatform: local
+      _value: ['slim', 'l2filter', 'fpgm', 'apoz']
-   #choice: true, false
+    model:
-   useAnnotation: False
+      _type: choice
-   searchSpacePath: search_space.json
+      _value: ['vgg16', 'vgg19']
+    trainingService:
+    platform: local
+    trialCodeDirectory: .
+    trialCommand: python3 basic_pruners_torch.py --nni
+    trialConcurrency: 1
+    trialGpuNumber: 0
    tuner:
-     #choice: TPE, Random, Anneal...
+      name: grid
-     builtinTunerName: TPE
-     classArgs:
+The full example can be found :githublink:`here <examples/model_compress/pruning/config.yml>`
-       #choice: maximize, minimize
-       optimize_mode: maximize
+Finally, start the searching via
-   trial:
-     command: bash run_prune.sh
+.. code-block:: bash
-     codeDir: .
-     gpuNum: 1
+   nnictl create -c config.yml
--- a/docs/en_US/Compression/Pruner.rst
+++ b/docs/en_US/Compression/Pruner.rst
 Supported Pruning Algorithms on NNI
 ===================================
-We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in  unstructured models, which need specialized haredware or software to speed up the sparse network.** Filter Pruning** achieves acceleratation by removing the entire filter.  We also provide an algorithm to control the** pruning schedule**.
+We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in  unstructured models, which need specialized hardware or software to speed up the sparse network. **Filter Pruning** achieves acceleration by removing the entire filter. Some pruning algorithms use one-shot method that prune weights at once based on an importance metric. Other pruning algorithms control the **pruning schedule** that prune weights during optimization, including some automatic pruning algorithms.
-**Fine-grained Pruning**
+**Fine-grained Pruning**
 * `Level Pruner <#level-pruner>`__
 **Filter Pruning**
 * `Slim Pruner <#slim-pruner>`__
 * `FPGM Pruner <#fpgm-pruner>`__
 * `L1Filter Pruner <#l1filter-pruner>`__
@@ -21,7 +20,6 @@ We provide several pruning algorithms that support fine-grained weight pruning a
 **Pruning Schedule**
 * `AGP Pruner <#agp-pruner>`__
 * `NetAdapt Pruner <#netadapt-pruner>`__
 * `SimulatedAnnealing Pruner <#simulatedannealing-pruner>`__
@@ -31,7 +29,6 @@ We provide several pruning algorithms that support fine-grained weight pruning a
 **Others**
 * `ADMM Pruner <#admm-pruner>`__
 * `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
@@ -45,15 +42,6 @@ We first sort the weights in the specified layer by their absolute values. And t
 Usage
 ^^^^^
-Tensorflow code
-.. code-block:: python
-   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
-   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-   pruner = LevelPruner(model, config_list)
-   pruner.compress()
 PyTorch code
 .. code-block:: python
@@ -70,26 +58,14 @@ User configuration for Level Pruner
 ..  autoclass:: nni.algorithms.compression.pytorch.pruning.LevelPruner
-Tensorflow
+**TensorFlow**
-""""""""""
 ..  autoclass:: nni.algorithms.compression.tensorflow.pruning.LevelPruner
 Slim Pruner
 -----------
+This is an one-shot pruner, which adds sparsity regularization on the scaling factors of batch normalization (BN) layers during training to identify unimportant channels. The channels with small scaling factor values will be pruned. For more details, please refer to `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\.
-This is an one-shot pruner, In `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\ , authors Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan and Changshui Zhang.
-.. image:: ../../img/slim_pruner.png
-   :target: ../../img/slim_pruner.png
-   :alt: 
-..
-   Slim Pruner **prunes channels in the convolution layers by masking corresponding scaling factors in the later BN layers**\ , L1 regularization on the scaling factors should be applied in batch normalization (BN) layers while training, scaling factors of BN layers are** globally ranked** while pruning, so the sparse model can be automatically found given sparsity.
 Usage
 ^^^^^
@@ -124,36 +100,29 @@ We implemented one of the experiments in `Learning Efficient Convolutional Netwo
     - Parameters
     - Pruned
   * - VGGNet
-     - 6.34/6.40
+     - 6.34/6.69
     - 20.04M
     - 
   * - Pruned-VGGNet
-     - 6.20/6.26
+     - 6.20/6.34
     - 2.03M
     - 88.5%
-The experiments code can be found at :githublink:`examples/model_compress/pruning/reproduced/slim_torch_cifar10.py <examples/model_compress/pruning/reproduced/slim_torch_cifar10.py>`
+The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`
----
-FPGM Pruner
-----------
-This is an one-shot pruner, FPGM Pruner is an implementation of paper `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__
-FPGMPruner prune filters with the smallest geometric median.
+.. code-block:: python
-.. image:: ../../img/fpgm_fig1.png
+   python basic_pruners_torch.py --pruner slim --model vgg19 --sparsity 0.7 --speed-up
-   :target: ../../img/fpgm_fig1.png
-   :alt: 
-..
+----
-   Previous works utilized “smaller-norm-less-important” criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with “relatively less” importance. 
+FPGM Pruner
+-----------
+This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
+For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
 We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
@@ -182,21 +151,11 @@ User configuration for FPGM Pruner
 L1Filter Pruner
 ---------------
-This is an one-shot pruner, In `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__\ , authors Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet and Hans Peter Graf.
+This is an one-shot pruner, which prunes the filters in the **convolution layers**.
-.. image:: ../../img/l1filter_pruner.png
-   :target: ../../img/l1filter_pruner.png
-   :alt: 
 ..
-   L1Filter Pruner prunes filters in the **convolution layers**
   The procedure of pruning m filters from the ith convolutional layer is as follows:
   #. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.
   #. Sort the filters by :math:`s_j`.
@@ -207,6 +166,9 @@ This is an one-shot pruner, In `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://
   #. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
      weights are copied to the new model.
+For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__\.
 In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
@@ -252,7 +214,11 @@ We implemented one of the experiments in `PRUNING FILTERS FOR EFFICIENT CONVNETS
     - 64.0%
-The experiments code can be found at :githublink:`examples/model_compress/pruning/reproduced/L1_torch_cifar10.py <examples/model_compress/pruning/reproduced/L1_torch_cifar10.py>`
+The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`
+.. code-block:: python
+   python basic_pruners_torch.py --pruner l1filter --model vgg16 --speed-up
 ----
@@ -291,10 +257,7 @@ ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the sm
 The APoZ is defined as:
+:math:`APoZ_{c}^{(i)} = APoZ\left(O_{c}^{(i)}\right)=\frac{\sum_{k}^{N} \sum_{j}^{M} f\left(O_{c, j}^{(i)}(k)=0\right)}{N \times M}`
-.. image:: ../../img/apoz.png
-   :target: ../../img/apoz.png
-   :alt: 
 We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
@@ -316,7 +279,7 @@ PyTorch code
 Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
-You can view :githublink:`example <examples/model_compress/pruning/model_prune_torch.py>` for more information.
+You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
 User configuration for ActivationAPoZRankFilter Pruner
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -351,7 +314,7 @@ PyTorch code
 Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
-You can view :githublink:`example <examples/model_compress/pruning/model_prune_torch.py>` for more information.
+You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
 User configuration for ActivationMeanRankFilterPruner
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -369,13 +332,7 @@ TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based
 ..
+:math:`\widehat{\mathcal{I}}_{\mathcal{S}}^{(1)}(\mathbf{W}) \triangleq \sum_{s \in \mathcal{S}} \mathcal{I}_{s}^{(1)}(\mathbf{W})=\sum_{s \in \mathcal{S}}\left(g_{s} w_{s}\right)^{2}`
-.. image:: ../../img/importance_estimation_sum.png
-   :target: ../../img/importance_estimation_sum.png
-   :alt: 
 We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
@@ -407,24 +364,17 @@ User configuration for TaylorFOWeightFilter Pruner
 AGP Pruner
 ----------
-This is an iterative pruner, In `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\ , authors Michael Zhu and Suyog Gupta provide an algorithm to prune the weight gradually.
+This is an iterative pruner, which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step :math:`t_{0}` and with pruning frequency :math:`\Delta t`:
-..
-   We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step t0 and with pruning frequency ∆t:
-   .. image:: ../../img/agp_pruner.png
-      :target: ../../img/agp_pruner.png
-      :alt: 
+:math:`s_{t}=s_{f}+\left(s_{i}-s_{f}\right)\left(1-\frac{t-t_{0}}{n \Delta t}\right)^{3} \text { for } t \in\left\{t_{0}, t_{0}+\Delta t, \ldots, t_{0} + n \Delta t\right\}`
-   The binary weight masks are updated every ∆t steps as the network is trained to gradually increase the sparsity of the network while allowing the network training steps to recover from any pruning-induced loss in accuracy. In our experience, varying the pruning frequency ∆t between 100 and 1000 training steps had a negligible impact on the final model quality. Once the model achieves the target sparsity sf , the weight masks are no longer updated. The intuition behind this sparsity function in equation (1).
+For more details please refer to `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\.
 Usage
 ^^^^^
-You can prune all weight from 0% to 80% sparsity in 10 epoch with the code below.
+You can prune all weights from 0% to 80% sparsity in 10 epoch with the code below.
 PyTorch code
@@ -471,7 +421,6 @@ PyTorch code
   pruner.update_epoch(epoch)
-You can view :githublink:`example <examples/model_compress/pruning/model_prune_torch.py>` for more information.
 User configuration for AGP Pruner
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -491,11 +440,6 @@ Given the overall sparsity, NetAdapt will automatically generate the sparsities
 For more details, please refer to `NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications <https://arxiv.org/abs/1804.03230>`__.
-.. image:: ../../img/algo_NetAdapt.png
-   :target: ../../img/algo_NetAdapt.png
-   :alt: 
 Usage
 ^^^^^
@@ -610,11 +554,6 @@ This learning-based compression policy outperforms conventional rule-based compr
 better preserving the accuracy and freeing human labor.
-.. image:: ../../img/amc_pruner.jpg
-   :target: ../../img/amc_pruner.jpg
-   :alt: 
 For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
 Usage
@@ -742,7 +681,6 @@ PyTorch code
 The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
-*Tensorflow version will be supported later.*
 User configuration for LotteryTicket Pruner
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -754,7 +692,7 @@ User configuration for LotteryTicket Pruner
 Reproduced Experiment
 ^^^^^^^^^^^^^^^^^^^^^
-We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/pruning/reproduced/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
+We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/pruning/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
 .. image:: ../../img/lottery_ticket_mnist_fc.png

--- a/docs/en_US/Compression/QuickStart.rst
+++ b/docs/en_US/Compression/QuickStart.rst
@@ -45,7 +45,7 @@ After training, you get accuracy of the pruned model. You can export model weigh
   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
-The complete code of model compression examples can be found :githublink:`here <examples/model_compress/pruning/model_prune_torch.py>`.
+Please refer :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for quick start.
 Speed up the model
 ^^^^^^^^^^^^^^^^^^
@@ -73,15 +73,6 @@ PyTorch code
   pruner = LevelPruner(model, config_list)
   pruner.compress()
-Tensorflow code
-.. code-block:: python
-   from nni.algorithms.compression.tensorflow.pruning import LevelPruner
-   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-   pruner = LevelPruner(tf.get_default_graph(), config_list)
-   pruner.compress()
 You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.rst>`__ and `Quantizer <./Quantizer.rst>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
 A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.

--- a/docs/en_US/TrialExample/KDExample.rst
+++ b/docs/en_US/TrialExample/KDExample.rst
@@ -4,14 +4,13 @@ Knowledge Distillation on NNI
 KnowledgeDistill
 ----------------
-Knowledge distillation support, in `Distilling the Knowledge in a Neural Network <https://arxiv.org/abs/1503.02531>`__\ ,  the compressed model is trained to mimic a pre-trained, larger model.  This training setting is also referred to as "teacher-student",  where the large model is the teacher and the small model is the student.
+Knowledge Distillation (KD) is proposed in `Distilling the Knowledge in a Neural Network <https://arxiv.org/abs/1503.02531>`__\ ,  the compressed model is trained to mimic a pre-trained, larger model.  This training setting is also referred to as "teacher-student",  where the large model is the teacher and the small model is the student. KD is often used to fine-tune the pruned model.
 .. image:: ../../img/distill.png
   :target: ../../img/distill.png
   :alt: 
 Usage
 ^^^^^
@@ -19,24 +18,29 @@ PyTorch code
 .. code-block:: python
-   from knowledge_distill.knowledge_distill import KnowledgeDistill
-   kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
-   alpha = 1
-   beta = 0.8
      for batch_idx, (data, target) in enumerate(train_loader):
         data, target = data.to(device), target.to(device)
         optimizer.zero_grad()
-       output = model(data)
+         y_s = model_s(data)
-       loss = F.cross_entropy(output, target)
+         y_t = model_t(data)
-       # you only to add the following line to fine-tune with knowledge distillation
+         loss_cri = F.cross_entropy(y_s, target)
-       loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
+         # kd loss
+         p_s = F.log_softmax(y_s/kd_T, dim=1)
+         p_t = F.softmax(y_t/kd_T, dim=1)
+         loss_kd = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
+         # total loss
+         loss = loss_cir + loss_kd
         loss.backward()
-User configuration for KnowledgeDistill
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The complete code for fine-tuning the pruend model can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`
+.. code-block:: python
+      python finetune_kd_torch.py --model [model name] --teacher-model-dir [pretrained checkpoint path]  --student-model-dir [pruend checkpoint path] --mask-path [mask file path]
+Note that: for fine-tuning a pruned model, run :githublink:`basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>` first to get the mask file, then pass the mask path as argument to the script.
-* **kd_teacher_model:** The pre-trained teacher model 
-* **kd_T:** Temperature for smoothing teacher model's output
-The complete code can be found `here <https://github.com/microsoft/nni/tree/v1.3/examples/model_compress/knowledge_distill/>`__
--- a/examples/model_compress/amc/README_zh_CN.md
+++ b/examples/model_compress/amc/README_zh_CN.md
-# AMCPruner 示例
-此示例将说明如何使用 AMCPruner。
-## 步骤一：训练模型
-运行以下命令来训练 mobilenetv2 模型：
-```bash
-python3 amc_train.py --model_type mobilenetv2 --n_epoch 50
-```
-训练完成之后，检查点文件被保存在这里：
-```
-logs/mobilenetv2_cifar10_train-run1/ckpt.best.pth
-```
-## 使用 AMCPruner 剪枝
-运行以下命令对模型进行剪枝：
-```bash
-python3 amc_search.py --model_type mobilenetv2 --ckpt logs/mobilenetv2_cifar10_train-run1/ckpt.best.pth
-```
-完成之后，剪枝后的模型和掩码文件被保存在：
-```
-logs/mobilenetv2_cifar10_r0.5_search-run2
-```
-## 微调剪枝后的模型
-加上 `--ckpt` 和 `--mask` 参数，再次运行 `amc_train.py` 命令去加速和微调剪枝后的模型。
-```bash
-python3 amc_train.py --model_type mobilenetv2 --ckpt logs/mobilenetv2_cifar10_r0.5_search-run2/best_model.pth --mask logs/mobilenetv2_cifar10_r0.5_search-run2/best_mask.pth --n_epoch 100
-```
--- a/examples/model_compress/pruning/README.md
+++ b/examples/model_compress/pruning/README.md
-# Run model compression examples
-You can run these examples easily like this, take torch pruning for example
-```bash
-python model_prune_torch.py
-```
-This example uses AGP Pruner. Initiating a pruner needs a user provided configuration which can be provided in two ways:
- By reading ```configure_example.yaml```, this can make code clean when your configuration is complicated
- Directly config in your codes
-In our example, we simply config model compression in our codes like this
-```python
-config_list = [{
-    'initial_sparsity': 0,
-    'final_sparsity': 0.8,
-    'start_epoch': 0,
-    'end_epoch': 10,
-    'frequency': 1,
-    'op_types': ['default']
-}]
-pruner = AGPPruner(config_list)
-```
-When ```pruner(model)``` is called, your model is injected with masks as embedded operations. For example, a layer takes a weight as input, we will insert an operation between the weight and the layer, this operation takes the weight as input and outputs a new weight applied by the mask. Thus, the masks are applied at any time the computation goes through the operations. You can fine-tune your model **without** any modifications.
-```python
-for epoch in range(10):
-    # update_epoch is for pruner to be aware of epochs, so that it could adjust masks during training.
-    pruner.update_epoch(epoch)
-    print('# Epoch {} #'.format(epoch))
-    train(model, device, train_loader, optimizer)
-    test(model, device, test_loader)
-```
-When fine tuning finished,  pruned weights are all masked and you can get masks like this
-```
-masks = pruner.mask_list
-layer_name = xxx
-mask = masks[layer_name]
-```
--- a/examples/model_compress/pruning/README_zh_CN.md
+++ b/examples/model_compress/pruning/README_zh_CN.md
-# 运行模型压缩示例
-以 PyTorch 剪枝为例：
-```bash
-python main_torch_pruner.py
-```
-此示例使用了 AGP Pruner。 初始化 Pruner 需要通过以下两种方式来提供配置。
- 读取 `configure_example.yaml`，这样代码会更整洁，但配置会比较复杂。
- 直接在代码中配置
-此例在代码中配置了模型压缩：
-```python
-config_list = [{
-    'initial_sparsity': 0,
-    'final_sparsity': 0.8,
-    'start_epoch': 0,
-    'end_epoch': 10,
-    'frequency': 1,
-    'op_types': ['default']
-}]
-pruner = AGPPruner(config_list)
-```
-当调用 `pruner(model)` 时，模型会被嵌入掩码操作。 例如，某层以权重作为输入，可在权重和层操作之间插入一个操作，此操作以权重为输入，并将其应用掩码后输出。 因此，计算过程中，只要通过此操作，就会应用掩码。 还可以**不做任何改动**，来对模型进行微调。
-```python
-for epoch in range(10):
-    # update_epoch 来让 Pruner 知道 Epoch 的数量，从而能够在训练过程中调整掩码。
-    pruner.update_epoch(epoch)
-    print('# Epoch {} #'.format(epoch))
-    train(model, device, train_loader, optimizer)
-    test(model, device, test_loader)
-```
-微调完成后，被修剪过的权重可通过以下代码获得：
-```
-masks = pruner.mask_list
-layer_name = xxx
-mask = masks[layer_name]
-```
--- a/examples/model_compress/pruning/auto_pruners_torch.py
+++ b/examples/model_compress/pruning/auto_pruners_torch.py
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT license.
 '''
-Examples for automatic pruners
+Example for supported automatic pruning algorithms.
+In this example, we present the usage of automatic pruners (NetAdapt, AutoCompressPruner). L1, L2, FPGM pruners are also executed for comparison purpose.
 '''
 import argparse
@@ -62,30 +64,6 @@ def get_data(dataset, data_dir, batch_size, test_batch_size):
            ])),
            batch_size=batch_size, shuffle=False, **kwargs)
        criterion = torch.nn.CrossEntropyLoss()
-    elif dataset == 'imagenet':
-        normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
-                                         std=[0.229, 0.224, 0.225])
-        train_loader = torch.utils.data.DataLoader(
-            datasets.ImageFolder(os.path.join(data_dir, 'train'),
-                                 transform=transforms.Compose([
-                                     transforms.RandomResizedCrop(224),
-                                     transforms.RandomHorizontalFlip(),
-                                     transforms.ToTensor(),
-                                     normalize,
-                                 ])),
-            batch_size=batch_size, shuffle=True, **kwargs)
-        val_loader = torch.utils.data.DataLoader(
-            datasets.ImageFolder(os.path.join(data_dir, 'val'),
-                                 transform=transforms.Compose([
-                                     transforms.Resize(256),
-                                     transforms.CenterCrop(224),
-                                     transforms.ToTensor(),
-                                     normalize,
-                                 ])),
-            batch_size=test_batch_size, shuffle=True, **kwargs)
-        criterion = torch.nn.CrossEntropyLoss()
    return train_loader, val_loader, criterion
@@ -248,7 +226,6 @@ def main(args):
        'op_types': op_types
    }]
    dummy_input = get_dummy_input(args, device)
    if args.pruner == 'L1FilterPruner':
        pruner = L1FilterPruner(model, config_list)
    elif args.pruner == 'L2FilterPruner':

--- a/examples/model_compress/pruning/basic_pruners_torch.py
+++ b/examples/model_compress/pruning/basic_pruners_torch.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+NNI example for supported basic pruning algorithms.
+In this example, we show the end-to-end pruning process: pre-training -> pruning -> fine-tuning.
+Note that pruners use masks to simulate the real pruning. In order to obtain a real compressed model, model speed up is required.
+You can also try auto_pruners_torch.py to see the usage of some automatic pruning algorithms.
+'''
+import logging
+import argparse
+import os
+import time
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torch.optim.lr_scheduler import StepLR, MultiStepLR
+from torchvision import datasets, transforms
+from models.mnist.lenet import LeNet
+from models.cifar10.vgg import VGG
+from nni.compression.pytorch.utils.counter import count_flops_params
+import nni
+from nni.compression.pytorch import apply_compression_results, ModelSpeedup
+from nni.algorithms.compression.pytorch.pruning import (
+    LevelPruner,
+    SlimPruner,
+    FPGMPruner,
+    L1FilterPruner,
+    L2FilterPruner,
+    AGPPruner,
+    ActivationAPoZRankFilterPruner
+)
+_logger = logging.getLogger('mnist_example')
+_logger.setLevel(logging.INFO)
+str2pruner = {
+    'level': LevelPruner,
+    'l1filter': L1FilterPruner,
+    'l2filter': L2FilterPruner,
+    'slim': SlimPruner,
+    'agp': AGPPruner,
+    'fpgm': FPGMPruner,
+    'apoz': ActivationAPoZRankFilterPruner
+}
+def get_dummy_input(args, device):
+    if args.dataset == 'mnist':
+        dummy_input = torch.randn([args.test_batch_size, 1, 28, 28]).to(device)
+    elif args.dataset in ['cifar10', 'imagenet']:
+        dummy_input = torch.randn([args.test_batch_size, 3, 32, 32]).to(device)
+    return dummy_input
+def get_pruner(model, pruner_name, device, optimizer=None, dependency_aware=False):
+    pruner_cls = str2pruner[pruner_name]
+    if pruner_name == 'level':
+        config_list = [{
+            'sparsity': args.sparsity,
+            'op_types': ['default']
+        }]
+    elif pruner_name == 'l1filter':
+        # Reproduced result in paper 'PRUNING FILTERS FOR EFFICIENT CONVNETS',
+        # Conv_1, Conv_8, Conv_9, Conv_10, Conv_11, Conv_12 are pruned with 50% sparsity, as 'VGG-16-pruned-A'
+        config_list = [{
+            'sparsity': args.sparsity,
+            'op_types': ['Conv2d'],
+            'op_names': ['feature.0', 'feature.24', 'feature.27', 'feature.30', 'feature.34', 'feature.37']
+        }]
+    elif pruner_name == 'slim':
+        config_list = [{
+            'sparsity': args.sparsity,
+            'op_types': ['BatchNorm2d'],
+        }]
+    else:
+        config_list = [{
+            'sparsity': args.sparsity,
+            'op_types': ['Conv2d']
+        }]
+    kw_args = {}
+    if dependency_aware:
+        dummy_input = get_dummy_input(args, device)
+        print('Enable the dependency_aware mode')
+        # note that, not all pruners support the dependency_aware mode
+        kw_args['dependency_aware'] = True
+        kw_args['dummy_input'] = dummy_input
+    pruner = pruner_cls(model, config_list, optimizer, **kw_args)
+    return pruner
+def get_data(dataset, data_dir, batch_size, test_batch_size):
+    kwargs = {'num_workers': 1, 'pin_memory': True} if torch.cuda.is_available() else {
+    }
+    if dataset == 'mnist':
+        train_loader = torch.utils.data.DataLoader(
+            datasets.MNIST(data_dir, train=True, download=True,
+                           transform=transforms.Compose([
+                               transforms.ToTensor(),
+                               transforms.Normalize((0.1307,), (0.3081,))
+                           ])),
+            batch_size=batch_size, shuffle=True, **kwargs)
+        test_loader = torch.utils.data.DataLoader(
+            datasets.MNIST(data_dir, train=False,
+                           transform=transforms.Compose([
+                               transforms.ToTensor(),
+                               transforms.Normalize((0.1307,), (0.3081,))
+                           ])),
+            batch_size=test_batch_size, shuffle=True, **kwargs)
+        criterion = torch.nn.NLLLoss()
+    elif dataset == 'cifar10':
+        normalize = transforms.Normalize(
+            (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
+        train_loader = torch.utils.data.DataLoader(
+            datasets.CIFAR10(data_dir, train=True, transform=transforms.Compose([
+                transforms.RandomHorizontalFlip(),
+                transforms.RandomCrop(32, 4),
+                transforms.ToTensor(),
+                normalize,
+            ]), download=True),
+            batch_size=batch_size, shuffle=True, **kwargs)
+        test_loader = torch.utils.data.DataLoader(
+            datasets.CIFAR10(data_dir, train=False, transform=transforms.Compose([
+                transforms.ToTensor(),
+                normalize,
+            ])),
+            batch_size=batch_size, shuffle=False, **kwargs)
+        criterion = torch.nn.CrossEntropyLoss()
+    return train_loader, test_loader, criterion
+def get_model_optimizer_scheduler(args, device, train_loader, test_loader, criterion):
+    if args.model == 'lenet':
+        model = LeNet().to(device)
+        if args.pretrained_model_dir is None:
+            optimizer = torch.optim.Adadelta(model.parameters(), lr=1)
+            scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
+    elif args.model == 'vgg16':
+        model = VGG(depth=16).to(device)
+        if args.pretrained_model_dir is None:
+            optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+            scheduler = MultiStepLR(
+                optimizer, milestones=[int(args.pretrain_epochs*0.5), int(args.pretrain_epochs*0.75)], gamma=0.1)
+    elif args.model == 'vgg19':
+        model = VGG(depth=19).to(device)
+        if args.pretrained_model_dir is None:
+            optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+            scheduler = MultiStepLR(
+                optimizer, milestones=[int(args.pretrain_epochs*0.5), int(args.pretrain_epochs*0.75)], gamma=0.1)
+    else:
+        raise ValueError("model not recognized")
+    if args.pretrained_model_dir is None:
+        print('start pre-training...')
+        best_acc = 0
+        for epoch in range(args.pretrain_epochs):
+            train(args, model, device, train_loader, criterion, optimizer, epoch, sparse_bn=True if args.pruner == 'slim' else False)
+            scheduler.step()
+            acc = test(args, model, device, criterion, test_loader)
+            if acc > best_acc:
+                best_acc = acc
+                state_dict = model.state_dict()
+        model.load_state_dict(state_dict)
+        acc = best_acc
+        torch.save(state_dict, os.path.join(args.experiment_data_dir, f'pretrain_{args.dataset}_{args.model}.pth'))
+        print('Model trained saved to %s' % args.experiment_data_dir)
+    else:
+        model.load_state_dict(torch.load(args.pretrained_model_dir))
+        best_acc = test(args, model, device, criterion, test_loader)
+    # setup new opotimizer for fine-tuning
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
+    scheduler = MultiStepLR(
+                optimizer, milestones=[int(args.pretrain_epochs*0.5), int(args.pretrain_epochs*0.75)], gamma=0.1)
+    print('Pretrained model acc:', best_acc)
+    return model, optimizer, scheduler
+def updateBN(model):
+    for m in model.modules():
+        if isinstance(m, nn.BatchNorm2d):
+            m.weight.grad.data.add_(0.0001 * torch.sign(m.weight.data))
+def train(args, model, device, train_loader, criterion, optimizer, epoch, sparse_bn=False):
+    model.train()
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = criterion(output, target)
+        loss.backward()
+        if sparse_bn:
+            # L1 regularization on BN layer
+            updateBN(model)
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
+                epoch, batch_idx * len(data), len(train_loader.dataset),
+                100. * batch_idx / len(train_loader), loss.item()))
+            if args.dry_run:
+                break
+def test(args, model, device, criterion, test_loader):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            test_loss += criterion(output, target).item()
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= len(test_loader.dataset)
+    acc = 100 * correct / len(test_loader.dataset)
+    print('Test Loss: {}  Accuracy: {}%\n'.format(
+        test_loss, acc))
+    return acc
+def main(args):
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    os.makedirs(args.experiment_data_dir, exist_ok=True)
+    # prepare model and data
+    train_loader, test_loader, criterion = get_data(args.dataset, args.data_dir, args.batch_size, args.test_batch_size)
+    model, optimizer, scheduler = get_model_optimizer_scheduler(args, device, train_loader, test_loader, criterion)
+    dummy_input = get_dummy_input(args, device)
+    flops, params, results = count_flops_params(model, dummy_input)
+    print(f"FLOPs: {flops}, params: {params}")
+    print('start pruning...')
+    model_path = os.path.join(args.experiment_data_dir, 'pruned_{}_{}_{}.pth'.format(
+        args.model, args.dataset, args.pruner))
+    mask_path = os.path.join(args.experiment_data_dir, 'mask_{}_{}_{}.pth'.format(
+        args.model, args.dataset, args.pruner))
+    pruner = get_pruner(model, args.pruner, device, optimizer, args.dependency_aware)
+    model = pruner.compress()
+    if args.multi_gpu and torch.cuda.device_count() > 1:
+        model = nn.DataParallel(model)
+    if args.test_only:
+        test(args, model, device, criterion, test_loader)
+    best_top1 = 0
+    for epoch in range(args.fine_tune_epochs):
+        pruner.update_epoch(epoch)
+        print('# Epoch {} #'.format(epoch))
+        train(args, model, device, train_loader, criterion, optimizer, epoch)
+        scheduler.step()
+        top1 = test(args, model, device, criterion, test_loader)
+        if top1 > best_top1:
+            best_top1 = top1
+            # Export the best model, 'model_path' stores state_dict of the pruned model,
+            # mask_path stores mask_dict of the pruned model
+            pruner.export_model(model_path=model_path, mask_path=mask_path)
+    if args.nni:
+        nni.report_final_result(best_top1)
+    if args.speed_up:
+        # reload the best checkpoint for speed-up
+        args.pretrained_model_dir = model_path
+        model, _, _ = get_model_optimizer_scheduler(args, device, train_loader, test_loader, criterion)
+        model.eval()
+        apply_compression_results(model, mask_path, device)
+        # test model speed
+        start = time.time()
+        for _ in range(32):
+            use_mask_out = model(dummy_input)
+        print('elapsed time when use mask: ', time.time() - start)
+        m_speedup = ModelSpeedup(model, dummy_input, mask_path, device)
+        m_speedup.speedup_model()
+        flops, params, results = count_flops_params(model, dummy_input)
+        print(f"FLOPs: {flops}, params: {params}")
+        start = time.time()
+        for _ in range(32):
+            use_speedup_out = model(dummy_input)
+        print('elapsed time when use speedup: ', time.time() - start)
+        top1 = test(args, model, device, criterion, test_loader)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='PyTorch Example for model comporession')
+    # dataset and model
+    parser.add_argument('--dataset', type=str, default='cifar10',
+                        help='dataset to use, mnist, cifar10 or imagenet')
+    parser.add_argument('--data-dir', type=str, default='./data/',
+                        help='dataset directory')
+    parser.add_argument('--model', type=str, default='vgg16',
+                        choices=['LeNet', 'vgg16' ,'vgg19', 'resnet18'],
+                        help='model to use')
+    parser.add_argument('--pretrained-model-dir', type=str, default=None,
+                        help='path to pretrained model')
+    parser.add_argument('--pretrain-epochs', type=int, default=160,
+                        help='number of epochs to pretrain the model')
+    parser.add_argument('--batch-size', type=int, default=128,
+                        help='input batch size for training')
+    parser.add_argument('--test-batch-size', type=int, default=200,
+                        help='input batch size for testing')
+    parser.add_argument('--experiment-data-dir', type=str, default='./experiment_data',
+                        help='For saving output checkpoints')
+    parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                        help='how many batches to wait before logging training status')
+    parser.add_argument('--dry-run', action='store_true', default=False,
+                        help='quickly check a single pass')
+    parser.add_argument('--multi-gpu', action='store_true', default=False,
+                        help='run on mulitple gpus')
+    parser.add_argument('--test-only', action='store_true', default=False,
+                        help='run test only')
+    # pruner
+    parser.add_argument('--sparsity', type=float, default=0.5,
+                        help='target overall target sparsity')
+    parser.add_argument('--dependency-aware', action='store_true', default=False,
+                        help='toggle dependency aware mode')
+    parser.add_argument('--pruner', type=str, default='l1filter',
+                        choices=['level', 'l1filter', 'l2filter', 'slim', 'agp',
+                        'fpgm', 'apoz'],
+                        help='pruner to use')
+    # fine-tuning
+    parser.add_argument('--fine-tune-epochs', type=int, default=160,
+                        help='epochs to fine tune')
+    # speed-up
+    parser.add_argument('--speed-up', action='store_true', default=False,
+                        help='whether to speed-up the pruned model')
+    parser.add_argument('--nni', action='store_true', default=False, 
+                        help="whether to tune the pruners using NNi tuners")
+    args = parser.parse_args()
+    if args.nni:
+         params = nni.get_next_parameter()
+         print(params)
+         args.sparsity = params['sparsity']
+         args.pruner = params['pruner']
+         args.model = params['pruner']
+    main(args)
--- a/examples/model_compress/pruning/config.yml
+++ b/examples/model_compress/pruning/config.yml
+searchSpace:
+  sparsity:
+    _type: choice
+    _value: [0.25, 0.5, 0.75]
+  pruner:
+    _type: choice
+    _value: ['slim', 'l2filter', 'fpgm', 'apoz']
+  model:
+    _type: choice
+    _value: ['vgg16', 'vgg19']
+trainingService:
+  platform: local
+trialCodeDirectory: .
+trialCommand: python3 basic_pruners_torch.py --nni
+trialConcurrency: 1
+trialGpuNumber: 0
+tuner:
+  name: grid
--- a/examples/model_compress/pruning/configure_example.yaml
+++ b/examples/model_compress/pruning/configure_example.yaml
-AGPruner: 
-  config:
-    -
-        start_epoch: 0
-        end_epoch: 10
-        frequency: 1
-        initial_sparsity: 0.05
-        final_sparsity: 0.8
-        op_types: ['default']
--- a/examples/model_compress/pruning/finetune_kd_torch.py
+++ b/examples/model_compress/pruning/finetune_kd_torch.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+NNI exmaple for fine-tuning the pruned model with KD.
+Run basic_pruners_torch.py first to get the masks of the pruned model. Then pass the mask as argument for model speedup. The compressed model is further used for fine-tuning.
+'''
+import argparse
+import os
+import time
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torch.optim.lr_scheduler import StepLR, MultiStepLR
+from torchvision import datasets, transforms
+from copy import deepcopy
+from models.mnist.lenet import LeNet
+from models.cifar10.vgg import VGG
+from basic_pruners_torch import get_data
+import nni
+from nni.compression.pytorch import ModelSpeedup, get_dummy_input
+class DistillKL(nn.Module):
+    """Distilling the Knowledge in a Neural Network"""
+    def __init__(self, T):
+        super(DistillKL, self).__init__()
+        self.T = T
+    def forward(self, y_s, y_t):
+        p_s = F.log_softmax(y_s/self.T, dim=1)
+        p_t = F.softmax(y_t/self.T, dim=1)
+        loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
+        return loss
+def get_model_optimizer_scheduler(args, device, test_loader, criterion):
+    if args.model == 'LeNet':
+        model = LeNet().to(device)
+    elif args.model == 'vgg16':
+        model = VGG(depth=16).to(device)
+    elif args.model == 'vgg19':
+        model = VGG(depth=19).to(device)
+    else:
+        raise ValueError("model not recognized")
+    # In this example, we set the architecture of teacher and student to be the same. It is feasible to set a different teacher architecture.
+    if args.teacher_model_dir is None:
+        raise NotImplementedError('please load pretrained teacher model first')
+    else:
+        model.load_state_dict(torch.load(args.teacher_model_dir))
+        best_acc = test(args, model, device, criterion, test_loader)
+    model_t = deepcopy(model)
+    model_s = deepcopy(model)
+    if args.student_model_dir is not None:
+        # load the pruned student model checkpoint
+        model_s.load_state_dict(torch.load(args.student_model_dir))
+    dummy_input = get_dummy_input(args, device)
+    m_speedup = ModelSpeedup(model_s, dummy_input, args.mask_path, device)
+    m_speedup.speedup_model()
+    module_list = nn.ModuleList([])
+    module_list.append(model_s)
+    module_list.append(model_t)
+    # setup opotimizer for fine-tuning studeng model
+    optimizer = torch.optim.SGD(model_s.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-4)
+    scheduler = MultiStepLR(
+                optimizer, milestones=[int(args.fine_tune_epochs*0.5), int(args.fine_tune_epochs*0.75)], gamma=0.1)
+    print('Pretrained teacher model acc:', best_acc)
+    return module_list, optimizer, scheduler
+def train(args, models, device, train_loader, criterion, optimizer, epoch):
+    # model.train()
+    model_s = models[0].train()
+    model_t = models[-1].eval()
+    cri_cls = criterion
+    cri_kd = DistillKL(args.kd_T)
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output_s = model_s(data)
+        output_t = model_t(data)
+        loss_cls = cri_cls(output_s, target)
+        loss_kd = cri_kd(output_s, output_t)
+        loss = loss_cls + loss_kd
+        loss.backward()
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
+                epoch, batch_idx * len(data), len(train_loader.dataset),
+                100. * batch_idx / len(train_loader), loss.item()))
+            if args.dry_run:
+                break
+def test(args, model, device, criterion, test_loader):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            test_loss += criterion(output, target).item()
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= len(test_loader.dataset)
+    acc = 100 * correct / len(test_loader.dataset)
+    print('Test Loss: {}  Accuracy: {}%\n'.format(
+        test_loss, acc))
+    return acc
+def main(args):
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    os.makedirs(args.experiment_data_dir, exist_ok=True)
+    # prepare model and data
+    train_loader, test_loader, criterion = get_data(args.dataset, args.data_dir, args.batch_size, args.test_batch_size)
+    models, optimizer, scheduler = get_model_optimizer_scheduler(args, device, test_loader, criterion)
+    best_top1 = 0
+    if args.test_only:
+        test(args, models[0], device, criterion, test_loader)
+    print('start fine-tuning...')
+    for epoch in range(args.fine_tune_epochs):
+        print('# Epoch {} #'.format(epoch))
+        train(args, models, device, train_loader, criterion, optimizer, epoch)
+        scheduler.step()
+        # test student only
+        top1 = test(args, models[0], device, criterion, test_loader)
+        if top1 > best_top1:
+            best_top1 = top1
+            torch.save(models[0].state_dict(), os.path.join(args.experiment_data_dir, 'model_trained.pth'))
+            print('Model trained saved to %s' % args.experiment_data_dir)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='PyTorch Example for model comporession')
+    # dataset and model
+    parser.add_argument('--dataset', type=str, default='cifar10',
+                        help='dataset to use, mnist, cifar10 or imagenet')
+    parser.add_argument('--data-dir', type=str, default='./data/',
+                        help='dataset directory')
+    parser.add_argument('--model', type=str, default='vgg16',
+                        choices=['LeNet', 'vgg16' ,'vgg19', 'resnet18'],
+                        help='model to use')
+    parser.add_argument('--teacher-model-dir', type=str, default=None,
+                        help='path to the pretrained teacher model checkpoint')
+    parser.add_argument('--mask-path', type=str, default=None,
+                        help='path to the pruned student model mask file')
+    parser.add_argument('--student-model-dir', type=str, default=None,
+                        help='path to the pruned student model checkpoint')
+    parser.add_argument('--batch-size', type=int, default=128,
+                        help='input batch size for training')
+    parser.add_argument('--test-batch-size', type=int, default=200,
+                        help='input batch size for testing')
+    parser.add_argument('--fine-tune-epochs', type=int, default=160,
+                        help='epochs to fine tune')
+    parser.add_argument('--experiment-data-dir', type=str, default='./experiment_data',
+                        help='For saving output checkpoints')
+    parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                        help='how many batches to wait before logging training status')
+    parser.add_argument('--dry-run', action='store_true', default=False,
+                        help='quickly check a single pass')
+    parser.add_argument('--test-only', action='store_true', default=False,
+                        help='run test only')
+    # knowledge distillation
+    parser.add_argument('--kd_T', type=float, default=4,
+                        help='temperature for KD distillation')
+    args = parser.parse_args()
+    main(args)
--- a/examples/model_compress/pruning/reproduced/lottery_torch_mnist_fc.py
+++ b/examples/model_compress/pruning/reproduced/lottery_torch_mnist_fc.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+NNI exmaple for reproducing Lottery Ticket Hypothesis.
+'''
 import argparse
 import copy
 import torch

--- a/examples/model_compress/pruning/model_prune_torch.py
+++ b/examples/model_compress/pruning/model_prune_torch.py
-import os
-import argparse
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.optim as optim
-from torch.utils.data import DataLoader
-from torchvision import datasets, transforms
-from models.cifar10.vgg import VGG
-import nni
-from nni.algorithms.compression.pytorch.pruning import (
-    LevelPruner,
-    SlimPruner,
-    FPGMPruner,
-    L1FilterPruner,
-    L2FilterPruner,
-    AGPPruner,
-    ActivationMeanRankFilterPruner,
-    ActivationAPoZRankFilterPruner
-)
-prune_config = {
-    'level': {
-        'dataset_name': 'mnist',
-        'model_name': 'naive',
-        'pruner_class': LevelPruner,
-        'config_list': [{
-            'sparsity': 0.5,
-            'op_types': ['default'],
-        }]
-    },
-    'agp': {
-        'dataset_name': 'mnist',
-        'model_name': 'naive',
-        'pruner_class': AGPPruner,
-        'config_list': [{
-            'initial_sparsity': 0.,
-            'final_sparsity': 0.8,
-            'start_epoch': 0,
-            'end_epoch': 10,
-            'frequency': 1,
-            'op_types': ['Conv2d']
-        }]
-    },
-    'slim': {
-        'dataset_name': 'cifar10',
-        'model_name': 'vgg19',
-        'pruner_class': SlimPruner,
-        'config_list': [{
-            'sparsity': 0.7,
-            'op_types': ['BatchNorm2d']
-        }]
-    },
-    'fpgm': {
-        'dataset_name': 'mnist',
-        'model_name': 'naive',
-        'pruner_class': FPGMPruner,
-        'config_list': [{
-            'sparsity': 0.5,
-            'op_types': ['Conv2d']
-        }]
-    },
-    'l1filter': {
-        'dataset_name': 'cifar10',
-        'model_name': 'vgg16',
-        'pruner_class': L1FilterPruner,
-        'config_list': [{
-            'sparsity': 0.5,
-            'op_types': ['Conv2d'],
-            'op_names': ['feature.0', 'feature.24', 'feature.27', 'feature.30', 'feature.34', 'feature.37']
-        }]
-    },
-    'mean_activation': {
-        'dataset_name': 'cifar10',
-        'model_name': 'vgg16',
-        'pruner_class': ActivationMeanRankFilterPruner,
-        'config_list': [{
-            'sparsity': 0.5,
-            'op_types': ['Conv2d'],
-            'op_names': ['feature.0', 'feature.24', 'feature.27', 'feature.30', 'feature.34', 'feature.37']
-        }]
-    },
-    'apoz': {
-        'dataset_name': 'cifar10',
-        'model_name': 'vgg16',
-        'pruner_class': ActivationAPoZRankFilterPruner,
-        'config_list': [{
-            'sparsity': 0.5,
-            'op_types': ['Conv2d'],
-            'op_names': ['feature.0', 'feature.24', 'feature.27', 'feature.30', 'feature.34', 'feature.37']
-        }]
-    }
-}
-def get_data_loaders(dataset_name='mnist', batch_size=128):
-    assert dataset_name in ['cifar10', 'mnist']
-    if dataset_name == 'cifar10':
-        ds_class = datasets.CIFAR10 if dataset_name == 'cifar10' else datasets.MNIST
-        MEAN, STD = (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)
-    else:
-        ds_class = datasets.MNIST
-        MEAN, STD = (0.1307,), (0.3081,)
-    train_loader = DataLoader(
-        ds_class(
-            './data', train=True, download=True,
-            transform=transforms.Compose(
-                [transforms.ToTensor(), transforms.Normalize(MEAN, STD)])
-        ),
-        batch_size=batch_size, shuffle=True
-    )
-    test_loader = DataLoader(
-        ds_class(
-            './data', train=False, download=True,
-            transform=transforms.Compose(
-                [transforms.ToTensor(), transforms.Normalize(MEAN, STD)])
-        ),
-        batch_size=batch_size, shuffle=False
-    )
-    return train_loader, test_loader
-class NaiveModel(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.conv1 = nn.Conv2d(1, 20, 5, 1)
-        self.conv2 = nn.Conv2d(20, 50, 5, 1)
-        self.bn1 = nn.BatchNorm2d(self.conv1.out_channels)
-        self.bn2 = nn.BatchNorm2d(self.conv2.out_channels)
-        self.fc1 = nn.Linear(4 * 4 * 50, 500)
-        self.fc2 = nn.Linear(500, 10)
-    def forward(self, x):
-        x = F.relu(self.bn1(self.conv1(x)))
-        x = F.max_pool2d(x, 2, 2)
-        x = F.relu(self.bn2(self.conv2(x)))
-        x = F.max_pool2d(x, 2, 2)
-        x = x.view(x.size(0), -1)
-        x = F.relu(self.fc1(x))
-        x = self.fc2(x)
-        return x
-def create_model(model_name='naive'):
-    assert model_name in ['naive', 'vgg16', 'vgg19']
-    if model_name == 'naive':
-        return NaiveModel()
-    elif model_name == 'vgg16':
-        return VGG(16)
-    else:
-        return VGG(19)
-def create_pruner(model, pruner_name, optimizer=None, dependency_aware=False, dummy_input=None):
-    pruner_class = prune_config[pruner_name]['pruner_class']
-    config_list = prune_config[pruner_name]['config_list']
-    kw_args = {}
-    if dependency_aware:
-        print('Enable the dependency_aware mode')
-        # note that, not all pruners support the dependency_aware mode
-        kw_args['dependency_aware'] = True
-        kw_args['dummy_input'] = dummy_input
-    pruner = pruner_class(model, config_list, optimizer, **kw_args)
-    return pruner
-def train(model, device, train_loader, optimizer):
-    model.train()
-    for batch_idx, (data, target) in enumerate(train_loader):
-        data, target = data.to(device), target.to(device)
-        optimizer.zero_grad()
-        output = model(data)
-        loss = F.cross_entropy(output, target)
-        loss.backward()
-        optimizer.step()
-        if batch_idx % 100 == 0:
-            print('{:2.0f}%  Loss {}'.format(
-                100 * batch_idx / len(train_loader), loss.item()))
-def test(model, device, test_loader):
-    model.eval()
-    test_loss = 0
-    correct = 0
-    with torch.no_grad():
-        for data, target in test_loader:
-            data, target = data.to(device), target.to(device)
-            output = model(data)
-            test_loss += F.cross_entropy(output,
-                                         target, reduction='sum').item()
-            pred = output.argmax(dim=1, keepdim=True)
-            correct += pred.eq(target.view_as(pred)).sum().item()
-    test_loss /= len(test_loader.dataset)
-    acc = 100 * correct / len(test_loader.dataset)
-    print('Loss: {}  Accuracy: {}%)\n'.format(
-        test_loss, acc))
-    return acc
-def main(args):
-    device = torch.device(
-        'cuda') if torch.cuda.is_available() else torch.device('cpu')
-    os.makedirs(args.checkpoints_dir, exist_ok=True)
-    model_name = prune_config[args.pruner_name]['model_name']
-    dataset_name = prune_config[args.pruner_name]['dataset_name']
-    train_loader, test_loader = get_data_loaders(dataset_name, args.batch_size)
-    dummy_input, _ = next(iter(train_loader))
-    dummy_input = dummy_input.to(device)
-    model = create_model(model_name).to(device)
-    if args.resume_from is not None and os.path.exists(args.resume_from):
-        print('loading checkpoint {} ...'.format(args.resume_from))
-        model.load_state_dict(torch.load(args.resume_from))
-        test(model, device, test_loader)
-    else:
-        optimizer = torch.optim.SGD(
-            model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
-        if args.multi_gpu and torch.cuda.device_count():
-            model = nn.DataParallel(model)
-        print('start training')
-        pretrain_model_path = os.path.join(
-            args.checkpoints_dir, 'pretrain_{}_{}_{}.pth'.format(model_name, dataset_name, args.pruner_name))
-        for epoch in range(args.pretrain_epochs):
-            train(model, device, train_loader, optimizer)
-            test(model, device, test_loader)
-        torch.save(model.state_dict(), pretrain_model_path)
-    print('start model pruning...')
-    model_path = os.path.join(args.checkpoints_dir, 'pruned_{}_{}_{}.pth'.format(
-        model_name, dataset_name, args.pruner_name))
-    mask_path = os.path.join(args.checkpoints_dir, 'mask_{}_{}_{}.pth'.format(
-        model_name, dataset_name, args.pruner_name))
-    # pruner needs to be initialized from a model not wrapped by DataParallel
-    if isinstance(model, nn.DataParallel):
-        model = model.module
-    optimizer_finetune = torch.optim.SGD(
-        model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
-    best_top1 = 0
-    pruner = create_pruner(model, args.pruner_name,
-                           optimizer_finetune, args.dependency_aware, dummy_input)
-    model = pruner.compress()
-    if args.multi_gpu and torch.cuda.device_count() > 1:
-        model = nn.DataParallel(model)
-    for epoch in range(args.prune_epochs):
-        pruner.update_epoch(epoch)
-        print('# Epoch {} #'.format(epoch))
-        train(model, device, train_loader, optimizer_finetune)
-        top1 = test(model, device, test_loader)
-        if top1 > best_top1:
-            best_top1 = top1
-            # Export the best model, 'model_path' stores state_dict of the pruned model,
-            # mask_path stores mask_dict of the pruned model
-            pruner.export_model(model_path=model_path, mask_path=mask_path)
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--pruner_name", type=str,
-                        default="level", help="pruner name")
-    parser.add_argument("--batch_size", type=int, default=256)
-    parser.add_argument("--pretrain_epochs", type=int,
-                        default=10, help="training epochs before model pruning")
-    parser.add_argument("--prune_epochs", type=int, default=10,
-                        help="training epochs for model pruning")
-    parser.add_argument("--checkpoints_dir", type=str,
-                        default="./checkpoints", help="checkpoints directory")
-    parser.add_argument("--resume_from", type=str,
-                        default=None, help="pretrained model weights")
-    parser.add_argument("--multi_gpu", action="store_true",
-                        help="Use multiple GPUs for training")
-    parser.add_argument("--dependency_aware", action="store_true", default=False,
-                        help="If enable the dependency_aware mode for the pruner")
-    args = parser.parse_args()
-    main(args)
--- a/examples/model_compress/pruning/model_speedup.py
+++ b/examples/model_compress/pruning/model_speedup.py
@@ -6,6 +6,7 @@ import torch.nn as nn
 import torch.nn.functional as F
 from torchvision import datasets, transforms
 from models.cifar10.vgg import VGG
+from models.mnist.lenet import LeNet
 from nni.compression.pytorch import apply_compression_results, ModelSpeedup
 torch.manual_seed(0)
@@ -15,24 +16,24 @@ compare_results = True
 config = {
    'apoz': {
-        'model_name': 'vgg16',
+        'model_name': 'lenet',
-        'input_shape': [64, 3, 32, 32],
+        'input_shape': [64, 1, 28, 28],
-        'masks_file': './checkpoints/mask_vgg16_cifar10_apoz.pth'
+        'masks_file': './experiment_data/mask_lenet_mnist_apoz.pth'
    },
    'l1filter': {
        'model_name': 'vgg16',
        'input_shape': [64, 3, 32, 32],
-        'masks_file': './checkpoints/mask_vgg16_cifar10_l1filter.pth'
+        'masks_file': './experiment_data/mask_vgg16_cifar10_l1filter.pth'
    },
    'fpgm': {
-        'model_name': 'naive',
+        'model_name': 'vgg16',
-        'input_shape': [64, 1, 28, 28],
+        'input_shape': [64, 3, 32, 32],
-        'masks_file': './checkpoints/mask_naive_mnist_fpgm.pth'
+        'masks_file': './experiment_data/mask_vgg16_cifar10_fpgm.pth'
    },
    'slim': {
        'model_name': 'vgg19',
        'input_shape': [64, 3, 32, 32],
-        'masks_file': './checkpoints/mask_vgg19_cifar10_slim.pth' #'mask_vgg19_cifar10.pth'
+        'masks_file': './experiment_data/mask_vgg19_cifar10_slim.pth'
    }
 }
@@ -46,9 +47,9 @@ def model_inference(config):
        model = VGG(depth=16)
    elif config['model_name'] == 'vgg19':
        model = VGG(depth=19)
-    elif config['model_name'] == 'naive':
+    elif config['model_name'] == 'lenet':
-        from model_prune_torch import NaiveModel
+        model = LeNet()
-        model = NaiveModel()
    model.to(device)
    model.eval()

--- a/examples/model_compress/pruning/naive_prune_tf.py
+++ b/examples/model_compress/pruning/naive_prune_tf.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+NNI example for quick start of pruning.
+In this example, we use level pruner to prune the LeNet on MNIST.
+'''
+import argparse
+import tensorflow as tf
+from tensorflow.keras import Model
+from tensorflow.keras.layers import (Conv2D, Dense, Dropout, Flatten, MaxPool2D)
+from nni.algorithms.compression.tensorflow.pruning import LevelPruner
+class LeNet(Model):
+    """
+    LeNet-5 Model with customizable hyper-parameters
+    """
+    def __init__(self, conv_size=3, hidden_size=32, dropout_rate=0.5):
+        """
+        Initialize hyper-parameters.
+        Parameters
+        ----------
+        conv_size : int
+            Kernel size of convolutional layers.
+        hidden_size : int
+            Dimensionality of last hidden layer.
+        dropout_rate : float
+            Dropout rate between two fully connected (dense) layers, to prevent co-adaptation.
+        """
+        super().__init__()
+        self.conv1 = Conv2D(filters=32, kernel_size=conv_size, activation='relu')
+        self.pool1 = MaxPool2D(pool_size=2)
+        self.conv2 = Conv2D(filters=64, kernel_size=conv_size, activation='relu')
+        self.pool2 = MaxPool2D(pool_size=2)
+        self.flatten = Flatten()
+        self.fc1 = Dense(units=hidden_size, activation='relu')
+        self.dropout = Dropout(rate=dropout_rate)
+        self.fc2 = Dense(units=10, activation='softmax')
+    def call(self, x):
+        """Override ``Model.call`` to build LeNet-5 model."""
+        x = self.conv1(x)
+        x = self.pool1(x)
+        x = self.conv2(x)
+        x = self.pool2(x)
+        x = self.flatten(x)
+        x = self.fc1(x)
+        x = self.dropout(x)
+        return self.fc2(x)
+def get_dataset(dataset_name='mnist'):
+    assert dataset_name == 'mnist'
+    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
+    x_train = x_train[..., tf.newaxis] / 255.0
+    x_test = x_test[..., tf.newaxis] / 255.0
+    return (x_train, y_train), (x_test, y_test)
+# def create_model(model_name='naive'):
+#     assert model_name == 'naive'
+#     return tf.keras.Sequential([
+#         tf.keras.layers.Conv2D(filters=20, kernel_size=5),
+#         tf.keras.layers.BatchNormalization(),
+#         tf.keras.layers.ReLU(),
+#         tf.keras.layers.MaxPool2D(pool_size=2),
+#         tf.keras.layers.Conv2D(filters=20, kernel_size=5),
+#         tf.keras.layers.BatchNormalization(),
+#         tf.keras.layers.ReLU(),
+#         tf.keras.layers.MaxPool2D(pool_size=2),
+#         tf.keras.layers.Flatten(),
+#         tf.keras.layers.Dense(units=500),
+#         tf.keras.layers.ReLU(),
+#         tf.keras.layers.Dense(units=10),
+#         tf.keras.layers.Softmax()
+#     ])
+def main(args):
+    train_set, test_set = get_dataset('mnist')
+    model = LeNet()
+    print('start training')
+    optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, decay=1e-4)
+    model.compile(
+        optimizer=optimizer,
+        loss='sparse_categorical_crossentropy',
+        metrics=['accuracy']
+    )
+    model.fit(
+        train_set[0],
+        train_set[1],
+        batch_size=args.batch_size,
+        epochs=args.pretrain_epochs,
+        validation_data=test_set
+    )
+    print('start pruning')
+    optimizer_finetune = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, decay=1e-4)
+    # create_pruner
+    prune_config = [{
+        'sparsity': args.sparsity,
+        'op_types': ['default'],
+    }]
+    pruner = LevelPruner(model, prune_config)
+    # pruner = create_pruner(model, args.pruner_name)
+    model = pruner.compress()
+    model.compile(
+        optimizer=optimizer_finetune,
+        loss='sparse_categorical_crossentropy',
+        metrics=['accuracy'],
+        run_eagerly=True  # NOTE: Important, model compression does not work in graph mode!
+    )
+    # fine-tuning
+    model.fit(
+        train_set[0],
+        train_set[1],
+        batch_size=args.batch_size,
+        epochs=args.prune_epochs,
+        validation_data=test_set
+    )
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--pruner_name', type=str, default='level')
+    parser.add_argument('--batch-size', type=int, default=256)
+    parser.add_argument('--pretrain_epochs', type=int, default=10)
+    parser.add_argument('--prune_epochs', type=int, default=10)
+    parser.add_argument('--sparsity', type=float, default=0.5)
+    args = parser.parse_args()
+    main(args)
--- a/examples/model_compress/pruning/naive_prune_torch.py
+++ b/examples/model_compress/pruning/naive_prune_torch.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+'''
+NNI example for quick start of pruning.
+In this example, we use level pruner to prune the LeNet on MNIST.
+'''
+import logging
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+from torchvision import datasets, transforms
+from torch.optim.lr_scheduler import StepLR
+from models.mnist.lenet import LeNet
+from nni.algorithms.compression.pytorch.pruning import LevelPruner
+import nni
+_logger = logging.getLogger('mnist_example')
+_logger.setLevel(logging.INFO)
+def train(args, model, device, train_loader, optimizer, epoch):
+    model.train()
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = F.nll_loss(output, target)
+        loss.backward()
+        optimizer.step()
+        if batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
+                epoch, batch_idx * len(data), len(train_loader.dataset),
+                100. * batch_idx / len(train_loader), loss.item()))
+            if args.dry_run:
+                break
+def test(model, device, test_loader):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            test_loss += F.nll_loss(output, target, reduction='sum').item()
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= len(test_loader.dataset)
+    acc = 100 * correct / len(test_loader.dataset)
+    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
+        test_loss, correct, len(test_loader.dataset), acc))
+    return acc
+def main(args):
+    torch.manual_seed(args.seed)
+    use_cuda = not args.no_cuda and torch.cuda.is_available()
+    device = torch.device("cuda" if use_cuda else "cpu")
+    train_kwargs = {'batch_size': args.batch_size}
+    test_kwargs = {'batch_size': args.test_batch_size}
+    if use_cuda:
+        cuda_kwargs = {'num_workers': 1,
+                       'pin_memory': True,
+                       'shuffle': True}
+        train_kwargs.update(cuda_kwargs)
+        test_kwargs.update(cuda_kwargs)
+    transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+        ])
+    dataset1 = datasets.MNIST('./data', train=True, download=True,
+                       transform=transform)
+    dataset2 = datasets.MNIST('./data', train=False,
+                       transform=transform)
+    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
+    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)
+    model = LeNet().to(device)
+    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
+    print('start pre-training')
+    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
+    for epoch in range(1, args.epochs + 1):
+        train(args, model, device, train_loader, optimizer, epoch)
+        test(model, device, test_loader)
+        scheduler.step()
+    torch.save(model.state_dict(), "pretrain_mnist_lenet.pt")
+    print('start pruning')
+    optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.01)
+    # create pruner
+    prune_config = [{
+        'sparsity': args.sparsity,
+        'op_types': ['default'],
+    }]
+    pruner = LevelPruner(model, prune_config, optimizer_finetune)
+    model = pruner.compress()
+    # fine-tuning
+    best_top1 = 0
+    for epoch in range(1, args.epochs + 1):
+        pruner.update_epoch(epoch)
+        train(args, model, device, train_loader, optimizer_finetune, epoch)
+        top1 = test(model, device, test_loader)
+        if top1 > best_top1:
+            best_top1 = top1
+            # Export the best model, 'model_path' stores state_dict of the pruned model,
+            # mask_path stores mask_dict of the pruned model
+            pruner.export_model(model_path='pruend_mnist_lenet.pt', mask_path='mask_mnist_lenet.pt')
+if __name__ == '__main__':
+     # Training settings
+    parser = argparse.ArgumentParser(description='PyTorch MNIST Example for model comporession')
+    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
+                        help='input batch size for training (default: 64)')
+    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
+                        help='input batch size for testing (default: 1000)')
+    parser.add_argument('--epochs', type=int, default=10, metavar='N',
+                        help='number of epochs to train (default: 10)')
+    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
+                        help='learning rate (default: 1.0)')
+    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
+                        help='Learning rate step gamma (default: 0.7)')
+    parser.add_argument('--no-cuda', action='store_true', default=False,
+                        help='disables CUDA training')
+    parser.add_argument('--dry-run', action='store_true', default=False,
+                        help='quickly check a single pass')
+    parser.add_argument('--seed', type=int, default=1, metavar='S',
+                        help='random seed (default: 1)')
+    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
+                        help='how many batches to wait before logging training status')
+    parser.add_argument('--sparsity', type=float, default=0.5,
+                        help='target overall target sparsity')
+    args = parser.parse_args()
+    main(args)
\ No newline at end of file
--- a/examples/model_compress/pruning/reproduced/L1_torch_pruner.py
+++ b/examples/model_compress/pruning/reproduced/L1_torch_pruner.py
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torchvision import datasets, transforms
-from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
-from models.cifar10.vgg import VGG
-def train(model, device, train_loader, optimizer):
-    model.train()
-    for batch_idx, (data, target) in enumerate(train_loader):
-        data, target = data.to(device), target.to(device)
-        optimizer.zero_grad()
-        output = model(data)
-        loss = F.cross_entropy(output, target)
-        loss.backward()
-        optimizer.step()
-        if batch_idx % 100 == 0:
-            print('{:2.0f}%  Loss {}'.format(100 * batch_idx / len(train_loader), loss.item()))
-def test(model, device, test_loader):
-    model.eval()
-    test_loss = 0
-    correct = 0
-    with torch.no_grad():
-        for data, target in test_loader:
-            data, target = data.to(device), target.to(device)
-            output = model(data)
-            test_loss += F.nll_loss(output, target, reduction='sum').item()
-            pred = output.argmax(dim=1, keepdim=True)
-            correct += pred.eq(target.view_as(pred)).sum().item()
-    test_loss /= len(test_loader.dataset)
-    acc = 100 * correct / len(test_loader.dataset)
-    print('Loss: {}  Accuracy: {}%)\n'.format(
-        test_loss, acc))
-    return acc
-def main():
-    torch.manual_seed(0)
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    train_loader = torch.utils.data.DataLoader(
-        datasets.CIFAR10('./data.cifar10', train=True, download=True,
-                         transform=transforms.Compose([
-                             transforms.Pad(4),
-                             transforms.RandomCrop(32),
-                             transforms.RandomHorizontalFlip(),
-                             transforms.ToTensor(),
-                             transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
-                         ])),
-        batch_size=64, shuffle=True)
-    test_loader = torch.utils.data.DataLoader(
-        datasets.CIFAR10('./data.cifar10', train=False, transform=transforms.Compose([
-            transforms.ToTensor(),
-            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
-        ])),
-        batch_size=200, shuffle=False)
-    model = VGG(depth=16)
-    model.to(device)
-    # Train the base VGG-16 model
-    print('=' * 10 + 'Train the unpruned base model' + '=' * 10)
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
-    lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 160, 0)
-    for epoch in range(160):
-        train(model, device, train_loader, optimizer)
-        test(model, device, test_loader)
-        lr_scheduler.step(epoch)
-    torch.save(model.state_dict(), 'vgg16_cifar10.pth')
-    # Test base model accuracy
-    print('=' * 10 + 'Test on the original model' + '=' * 10)
-    model.load_state_dict(torch.load('vgg16_cifar10.pth'))
-    test(model, device, test_loader)
-    # top1 = 93.51%
-    # Pruning Configuration, in paper 'PRUNING FILTERS FOR EFFICIENT CONVNETS',
-    # Conv_1, Conv_8, Conv_9, Conv_10, Conv_11, Conv_12 are pruned with 50% sparsity, as 'VGG-16-pruned-A'
-    configure_list = [{
-        'sparsity': 0.5,
-        'op_types': ['default'],
-        'op_names': ['feature.0', 'feature.24', 'feature.27', 'feature.30', 'feature.34', 'feature.37']
-    }]
-    # Prune model and test accuracy without fine tuning.
-    print('=' * 10 + 'Test on the pruned model before fine tune' + '=' * 10)
-    optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
-    pruner = L1FilterPruner(model, configure_list, optimizer_finetune)
-    model = pruner.compress()
-    test(model, device, test_loader)
-    # top1 = 88.19%
-    # Fine tune the pruned model for 40 epochs and test accuracy
-    print('=' * 10 + 'Fine tuning' + '=' * 10)
-    best_top1 = 0
-    for epoch in range(40):
-        pruner.update_epoch(epoch)
-        print('# Epoch {} #'.format(epoch))
-        train(model, device, train_loader, optimizer_finetune)
-        top1 = test(model, device, test_loader)
-        if top1 > best_top1:
-            best_top1 = top1
-            # Export the best model, 'model_path' stores state_dict of the pruned model,
-            # mask_path stores mask_dict of the pruned model
-            pruner.export_model(model_path='pruned_vgg16_cifar10.pth', mask_path='mask_vgg16_cifar10.pth')
-    # Test the exported model
-    print('=' * 10 + 'Test on the pruned model after fine tune' + '=' * 10)
-    new_model = VGG(depth=16)
-    new_model.to(device)
-    new_model.load_state_dict(torch.load('pruned_vgg16_cifar10.pth'))
-    test(new_model, device, test_loader)
-    # top1 = 93.53%
-if __name__ == '__main__':
-    main()
\ No newline at end of file
--- a/examples/model_compress/pruning/reproduced/slim_torch_pruner.py
+++ b/examples/model_compress/pruning/reproduced/slim_torch_pruner.py
-import math
-import os
-import argparse
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torchvision import datasets, transforms
-from nni.algorithms.compression.pytorch.pruning import SlimPruner
-from models.cifar10.vgg import VGG
-def updateBN(model):
-    for m in model.modules():
-        if isinstance(m, nn.BatchNorm2d):
-            m.weight.grad.data.add_(0.0001 * torch.sign(m.weight.data))  # L1
-def train(model, device, train_loader, optimizer, sparse_bn=False):
-    model.train()
-    for batch_idx, (data, target) in enumerate(train_loader):
-        data, target = data.to(device), target.to(device)
-        optimizer.zero_grad()
-        output = model(data)
-        loss = F.cross_entropy(output, target)
-        loss.backward()
-        # L1 regularization on BN layer
-        if sparse_bn:
-            updateBN(model)
-        optimizer.step()
-        if batch_idx % 100 == 0:
-            print('{:2.0f}%  Loss {}'.format(100 * batch_idx / len(train_loader), loss.item()))
-def test(model, device, test_loader):
-    model.eval()
-    test_loss = 0
-    correct = 0
-    with torch.no_grad():
-        for data, target in test_loader:
-            data, target = data.to(device), target.to(device)
-            output = model(data)
-            test_loss += F.nll_loss(output, target, reduction='sum').item()
-            pred = output.argmax(dim=1, keepdim=True)
-            correct += pred.eq(target.view_as(pred)).sum().item()
-    test_loss /= len(test_loader.dataset)
-    acc = 100 * correct / len(test_loader.dataset)
-    print('Loss: {}  Accuracy: {}%)\n'.format(
-        test_loss, acc))
-    return acc
-def main():
-    parser = argparse.ArgumentParser("multiple gpu with pruning")
-    parser.add_argument("--epochs", type=int, default=160)
-    parser.add_argument("--retrain", default=False, action="store_true")
-    parser.add_argument("--parallel", default=False, action="store_true")
-    args = parser.parse_args()
-    torch.manual_seed(0)
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    train_loader = torch.utils.data.DataLoader(
-        datasets.CIFAR10('./data.cifar10', train=True, download=True,
-                         transform=transforms.Compose([
-                             transforms.Pad(4),
-                             transforms.RandomCrop(32),
-                             transforms.RandomHorizontalFlip(),
-                             transforms.ToTensor(),
-                             transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
-                         ])),
-        batch_size=64, shuffle=True)
-    test_loader = torch.utils.data.DataLoader(
-        datasets.CIFAR10('./data.cifar10', train=False, transform=transforms.Compose([
-            transforms.ToTensor(),
-            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
-        ])),
-        batch_size=200, shuffle=False)
-    model = VGG(depth=19)
-    model.to(device)
-    # Train the base VGG-19 model
-    if args.retrain:
-        print('=' * 10 + 'Train the unpruned base model' + '=' * 10)
-        epochs = args.epochs
-        optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
-        for epoch in range(epochs):
-            if epoch in [epochs * 0.5, epochs * 0.75]:
-                for param_group in optimizer.param_groups:
-                    param_group['lr'] *= 0.1
-            print("epoch {}".format(epoch))
-            train(model, device, train_loader, optimizer, True)
-            test(model, device, test_loader)
-        torch.save(model.state_dict(), 'vgg19_cifar10.pth')
-    else:
-        assert os.path.isfile('vgg19_cifar10.pth'), "can not find checkpoint 'vgg19_cifar10.pth'"
-        model.load_state_dict(torch.load('vgg19_cifar10.pth'))
-    # Test base model accuracy
-    print('=' * 10 + 'Test the original model' + '=' * 10)
-    test(model, device, test_loader)
-    # top1 = 93.60%
-    # Pruning Configuration, in paper 'Learning efficient convolutional networks through network slimming',
-    configure_list = [{
-        'sparsity': 0.7,
-        'op_types': ['BatchNorm2d'],
-    }]
-    # Prune model and test accuracy without fine tuning.
-    print('=' * 10 + 'Test the pruned model before fine tune' + '=' * 10)
-    optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
-    pruner = SlimPruner(model, configure_list, optimizer_finetune)
-    model = pruner.compress()
-    if args.parallel:
-        if torch.cuda.device_count() > 1:
-            print("use {} gpus for pruning".format(torch.cuda.device_count()))
-            model = nn.DataParallel(model)
-            # model = nn.DataParallel(model, device_ids=[0, 1])
-        else:
-            print("only detect 1 gpu, fall back")
-    model.to(device)
-    # Fine tune the pruned model for 40 epochs and test accuracy
-    print('=' * 10 + 'Fine tuning' + '=' * 10)
-    best_top1 = 0
-    for epoch in range(40):
-        print('# Epoch {} #'.format(epoch))
-        train(model, device, train_loader, optimizer_finetune)
-        top1 = test(model, device, test_loader)
-        if top1 > best_top1:
-            best_top1 = top1
-            # Export the best model, 'model_path' stores state_dict of the pruned model,
-            # mask_path stores mask_dict of the pruned model
-            pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')
-    # Test the exported model
-    print('=' * 10 + 'Test the export pruned model after fine tune' + '=' * 10)
-    new_model = VGG(depth=19)
-    new_model.to(device)
-    new_model.load_state_dict(torch.load('pruned_vgg19_cifar10.pth'))
-    test(new_model, device, test_loader)
-    # top1 = 93.74%
-if __name__ == '__main__':
-    main()
\ No newline at end of file