create branch for v2.9

e773dfcc · qianyj · e773dfcc · e773dfcc · e773dfcc · e773dfcc
Commit e773dfcc authored Mar 21, 2023 by qianyj
20 changed files
--- a/docs/source/compression/best_practices.rst
+++ b/docs/source/compression/best_practices.rst
+Best Practices
+==============
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+    Pruning Transformer </tutorials/pruning_bert_glue>
--- a/docs/source/compression/compression_config_list.rst
+++ b/docs/source/compression/compression_config_list.rst
+Compression Config Specification
+================================
+Each sub-config in the config list is a dict, and the scope of each setting (key) is only internal to each sub-config.
+If multiple sub-configs are configured for the same layer, the later ones will overwrite the previous ones.
+Common Keys in Config
+---------------------
+op_types
+^^^^^^^^
+The type of the layers targeted by this sub-config.
+If ``op_names`` is not set in this sub-config, all layers in the model that satisfy the type will be selected.
+If ``op_names`` is set in this sub-config, the selected layers should satisfy both type and name.
+op_names
+^^^^^^^^
+The name of the layers targeted by this sub-config.
+If ``op_types`` is set in this sub-config, the selected layer should satisfy both type and name.
+exclude
+^^^^^^^
+The ``exclude`` and ``sparsity`` keyword are mutually exclusive and cannot exist in the same sub-config.
+If ``exclude`` is set in sub-config, the layers selected by this config will not be compressed.
+Special Keys for Pruning
+------------------------
+op_partial_names
+^^^^^^^^^^^^^^^^
+This key will share with `Quantization Config` in the future.
+This key is for the layers to be pruned with names that have the same sub-string. NNI will find all names in the model,
+find names that contain one of ``op_partial_names``, and append them into the ``op_names``.
+sparsity_per_layer
+^^^^^^^^^^^^^^^^^^
+The sparsity ratio of each selected layer.
+e.g., the ``sparsity_per_layer`` is 0.8 means each selected layer will mask 80% values on the weight.
+If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
+then ``layer_1`` will be masked 400 parameters and ``layer_2`` will be masked 800 parameters.
+total_sparsity
+^^^^^^^^^^^^^^
+The sparsity ratio of all selected layers, means that sparsity ratio may no longer be even between layers.
+e.g., the ``total_sparsity`` is 0.8 means 80% of parameters in this sub-config will be masked.
+If ``layer_1`` (500 parameters) and ``layer_2`` (1000 parameters) are selected in this sub-config,
+then ``layer_1`` and ``layer_2`` will be masked a total of 1200 parameters,
+how these total parameters are distributed between the two layers is determined by the pruning algorithm.
+sparsity
+^^^^^^^^
+``sparsity`` is an old config key from the pruning v1, it has the same meaning as ``sparsity_per_layer``.
+You can also use ``sparsity`` right now, but it will be deprecated in the future.
+max_sparsity_per_layer
+^^^^^^^^^^^^^^^^^^^^^^
+This key is usually used with ``total_sparsity``. It limits the maximum sparsity ratio of each layer.
+In ``total_sparsity`` example, there are 1200 parameters that need to be masked and all parameters in ``layer_1`` may be totally masked.
+To avoid this situation, ``max_sparsity_per_layer`` can be set as 0.9, this means up to 450 parameters can be masked in ``layer_1``,
+and 900 parameters can be masked in ``layer_2``.
+Special Keys for Quantization
+-----------------------------
+quant_types
+^^^^^^^^^^^
+Currently, nni support three kind of quantization types: 'weight', 'input', 'output'.
+It can be set as ``str`` or ``List[str]``.
+Note that 'weight' and 'input' are always quantize together, e.g., ``['input', 'weight']``.
+quant_bits
+^^^^^^^^^^
+Bits length of quantization, key is the quantization type set in ``quant_types``, value is the length,
+eg. {'weight': 8}, when the type is int, all quantization types share same bits length.
+quant_start_step
+^^^^^^^^^^^^^^^^
+Specific key for ``QAT Quantizer``. Disable quantization until model are run by certain number of steps,
+this allows the network to enter a more stable.
+State where output quantization ranges do not exclude a signiﬁcant fraction of values, default value is 0.
+Examples
+--------
+Suppose we want to compress the following model::
+    class Model(nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.conv1 = nn.Conv2d(1, 32, 3, 1)
+            self.conv2 = nn.Conv2d(32, 64, 3, 1)
+            self.dropout1 = nn.Dropout2d(0.25)
+            self.dropout2 = nn.Dropout2d(0.5)
+            self.fc1 = nn.Linear(9216, 128)
+            self.fc2 = nn.Linear(128, 10)
+        def forward(self, x):
+            ...
+First, we need to determine where to compress, use the following config list to specify all ``Conv2d`` modules and module named ``fc1``::
+    config_list = [{'op_types': ['Conv2d']}, {'op_names': ['fc1']}]
+Sometimes we may need to compress all modules of a certain type, except for a few special ones.
+Writing all the module names is laborious at this point, we can use ``exclude`` to quickly specify the compression target modules::
+    config_list = [{
+        'op_types': ['Conv2d', 'Linear']
+    }, {
+        'exclude': True,
+        'op_names': ['fc2']
+    }]
+The above two config lists are equivalent to the model we want to compress, they both use ``conv1``, ``conv2``, and ``fc1`` as compression targets.
+Let's take a simple pruning config list example, pruning all ``Conv2d`` modules with 50% sparsity, and pruning ``fc1`` with 80% sparsity::
+    config_list = [{
+        'op_types': ['Conv2d'],
+        'total_sparsity': 0.5
+    }, {
+        'op_names': ['fc1'],
+        'total_sparsity': 0.8
+    }]
+Then if you want to try model quantization, here is a simple config list example::
+    config_list = [{
+        'op_types': ['Conv2d'],
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8}
+    }, {
+        'op_names': ['fc1'],
+        'quant_types': ['input', 'weight'],
+        'quant_bits': {'input': 8, 'weight': 8}
+    }]
--- a/docs/source/compression/compression_evaluator.rst
+++ b/docs/source/compression/compression_evaluator.rst
+Compression Evaluator
+=====================
+The ``Evaluator`` is used to package the training and evaluation process for a targeted model.
+To explain why NNI needs an ``Evaluator``, let's first look at the general process of model compression in NNI.
+In model pruning, some algorithms need to prune according to some intermediate variables (gradients, activations, etc.) generated during the training process,
+and some algorithms need to gradually increase or adjust the sparsity of different layers during the training process,
+or adjust the pruning strategy according to the performance changes of the model during the pruning process.
+In model quantization, NNI has quantization-aware training algorithm,
+it can adjust the scale and zero point required for model quantization from time to time during the training process,
+and may achieve a better performance compare to post-training quantization.
+In order to better support the above algorithms' needs and maintain the consistency of the interface,
+NNI introduces the ``Evaluator`` as the carrier of the training and evaluation process.
+.. note::
+    For users prior to NNI v2.8: NNI previously provided APIs like ``trainer``, ``traced_optimizer``, ``criterion``, ``finetuner``.
+    These APIs were maybe tedious in terms of user experience. Users need to exchange the corresponding API frequently if they want to switch compression algorithms.
+    ``Evaluator`` is an alternative to the above interface, users only need to create the evaluator once and it can be used in all compressors.
+For users of native PyTorch, :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` requires the user to encapsulate the training process as a function and exposes the specified interface,
+which will bring some complexity. But don't worry, in most cases, this will not change too much code.
+For users of `PyTorchLightning <https://www.pytorchlightning.ai/>`__, :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` can be created with only a few lines of code based on your original Lightning code.
+Here we give two examples of how to create an ``Evaluator`` for both native PyTorch and PyTorchLightning users.
+TorchEvaluator
+--------------
+:class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` is for the users who work in a native PyTorch environment (If you are using PyTorchLightning, please refer `LightningEvaluator`_).
+:class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` has six initialization parameters ``training_func``, ``optimizers``, ``criterion``, ``lr_schedulers``,
+``dummy_input``, ``evaluating_func``.
+* ``training_func`` is the training loop to train the compressed model.
+  It is a callable function with six input parameters ``model``, ``optimizers``,
+  ``criterion``, ``lr_schedulers``, ``max_steps``, ``max_epochs``.
+  Please make sure each input argument of the ``training_func`` is actually used,
+  especially ``max_steps`` and ``max_epochs`` can correctly control the duration of training.
+* ``optimizers`` is a single / a list of traced optimizer(s),
+  please make sure using ``nni.trace`` wrapping the ``Optimizer`` class before initializing it / them.
+* ``criterion`` is a callable function to compute loss, it has two input parameters ``input`` and ``target``, and returns a tensor as loss.
+* ``lr_schedulers`` is a single / a list of traced scheduler(s), same as ``optimizers``,
+  please make sure using ``nni.trace`` wrapping the ``_LRScheduler`` class before initializing it / them.
+* ``dummy_input`` is used to trace the model, same as ``example_inputs``
+  in `torch.jit.trace <https://pytorch.org/docs/stable/generated/torch.jit.trace.html?highlight=torch%20jit%20trace#torch.jit.trace>`_.
+* ``evaluating_func`` is a callable function to evaluate the compressed model performance. Its input is a compressed model and its output is metric.
+  The format of metric should be a float number or a dict with key ``default``.
+Please refer :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` for more details.
+Here is an example of how to initialize a :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>`.
+.. code-block:: python
+    from __future__ import annotations
+    from typing import Callable, Any
+    import torch
+    from torch.optim.lr_scheduler import StepLR, _LRScheduler
+    from torch.utils.data import DataLoader
+    from torchvision import datasets, models
+    import nni
+    from nni.algorithms.compression.v2.pytorch import TorchEvaluator
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    def training_func(model: torch.nn.Module, optimizers: torch.optim.Optimizer,
+                      criterion: Callable[[Any, Any], torch.Tensor],
+                      lr_schedulers: _LRScheduler | None = None, max_steps: int | None = None,
+                      max_epochs: int | None = None, *args, **kwargs):
+        model.train()
+        # prepare data
+        imagenet_train_data = datasets.ImageNet(root='data/imagenet', split='train', download=True)
+        train_dataloader = DataLoader(imagenet_train_data, batch_size=4, shuffle=True)
+        #############################################################################
+        # NNI may change the training duration by setting max_steps or max_epochs.
+        # To ensure that NNI has the ability to control the training duration,
+        # please add max_steps and max_epochs as constraints to the training loop.
+        #############################################################################
+        total_epochs = max_epochs if max_epochs else 20
+        total_steps = max_steps if max_steps else 1000000
+        current_steps = 0
+        # training loop
+        for _ in range(total_epochs):
+            for inputs, labels in train_dataloader:
+                inputs, labels = inputs.to(device), labels.to(device)
+                optimizers.zero_grad()
+                loss = criterion(model(inputs), labels)
+                loss.backward()
+                optimizers.step()
+                ######################################################################
+                # stop the training loop when reach the total_steps
+                ######################################################################
+                current_steps += 1
+                if total_steps and current_steps == total_steps:
+                    return
+            lr_schedulers.step()
+    def evaluating_func(model: torch.nn.Module):
+        model.eval()
+        # prepare data
+        imagenet_val_data = datasets.ImageNet(root='./data/imagenet', split='val', download=True)
+        val_dataloader = DataLoader(imagenet_val_data, batch_size=4, shuffle=False)
+        # testing loop
+        correct = 0
+        with torch.no_grad():
+            for inputs, labels in val_dataloader:
+                inputs, labels = inputs.to(device), labels.to(device)
+                logits = model(inputs)
+                preds = torch.argmax(logits, dim=1)
+                correct += preds.eq(labels.view_as(preds)).sum().item()
+        return correct / len(imagenet_val_data)
+    # initialize the optimizer, criterion, lr_scheduler, dummy_input
+    model = models.resnet18().to(device)
+    ######################################################################
+    # please use nni.trace wrap the optimizer class,
+    # NNI will use the trace information to re-initialize the optimizer
+    ######################################################################
+    optimizer = nni.trace(torch.optim.Adam)(model.parameters(), lr=1e-3)
+    criterion = torch.nn.CrossEntropyLoss()
+    ######################################################################
+    # please use nni.trace wrap the lr_scheduler class,
+    # NNI will use the trace information to re-initialize the lr_scheduler
+    ######################################################################
+    lr_scheduler = nni.trace(StepLR)(optimizer, step_size=5, gamma=0.1)
+    dummy_input = torch.rand(4, 3, 224, 224).to(device)
+    # TorchEvaluator initialization
+    evaluator = TorchEvaluator(training_func=training_func, optimizers=optimizer, criterion=criterion,
+                               lr_schedulers=lr_scheduler, dummy_input=dummy_input, evaluating_func=evaluating_func)
+.. note::
+    It is also worth to note that not all the arguments of :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` must be provided.
+    Some compressors only require ``evaluate_func`` as they do not train the model, some compressors only require ``training_func``.
+    Please refer to each compressor's doc to check the required arguments.
+    But, it is fine to provide more arguments than the compressor's need.
+A complete example of pruner using :class:`TorchEvaluator <nni.compression.pytorch.TorchEvaluator>` to compress model can be found :githublink:`here <examples/model_compress/pruning/taylorfo_torch_evaluator.py>`.
+LightningEvaluator
+------------------
+:class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` is for the users who work with PyTorchLightning.
+Only three parts users need to modify compared with the original pytorch-lightning code:
+1. Wrap the ``Optimizer`` and ``_LRScheduler`` class with ``nni.trace``.
+2. Wrap the ``LightningModule`` class with ``nni.trace``.
+3. Wrap the ``LightningDataModule`` class with ``nni.trace``.
+Please refer :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` for more details.
+Here is an example of how to initialize a :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>`.
+.. code-block:: python
+    import pytorch_lightning as pl
+    from pytorch_lightning.loggers import TensorBoardLogger
+    import torch
+    from torch.optim.lr_scheduler import StepLR
+    from torch.utils.data import DataLoader
+    from torchmetrics.functional import accuracy
+    from torchvision import datasets, models
+    import nni
+    from nni.algorithms.compression.v2.pytorch import LightningEvaluator
+    class SimpleLightningModel(pl.LightningModule):
+        def __init__(self):
+            super().__init__()
+            self.model = models.resnet18()
+            self.criterion = torch.nn.CrossEntropyLoss()
+        def forward(self, x):
+            return self.model(x)
+        def training_step(self, batch, batch_idx):
+            x, y = batch
+            logits = self(x)
+            loss = self.criterion(logits, y)
+            self.log("train_loss", loss)
+            return loss
+        def evaluate(self, batch, stage=None):
+            x, y = batch
+            logits = self(x)
+            loss = self.criterion(logits, y)
+            preds = torch.argmax(logits, dim=1)
+            acc = accuracy(preds, y)
+            if stage:
+                self.log(f"default", loss, prog_bar=False)
+                self.log(f"{stage}_loss", loss, prog_bar=True)
+                self.log(f"{stage}_acc", acc, prog_bar=True)
+        def validation_step(self, batch, batch_idx):
+            self.evaluate(batch, "val")
+        def test_step(self, batch, batch_idx):
+            self.evaluate(batch, "test")
+        #####################################################################
+        # please pay attention to this function,
+        # using nni.trace trace the optimizer and lr_scheduler class.
+        #####################################################################
+        def configure_optimizers(self):
+            optimizer = nni.trace(torch.optim.SGD)(
+                self.parameters(),
+                lr=0.01,
+                momentum=0.9,
+                weight_decay=5e-4,
+            )
+            scheduler_dict = {
+                "scheduler": nni.trace(StepLR)(
+                    optimizer,
+                    step_size=5,
+                    amma=0.1
+                ),
+                "interval": "epoch",
+            }
+            return {"optimizer": optimizer, "lr_scheduler": scheduler_dict}
+    class ImageNetDataModule(pl.LightningDataModule):
+        def __init__(self, data_dir: str = "./data/imagenet"):
+            super().__init__()
+            self.data_dir = data_dir
+        def prepare_data(self):
+            # download
+            datasets.ImageNet(self.data_dir, split='train', download=True)
+            datasets.ImageNet(self.data_dir, split='val', download=True)
+        def setup(self, stage: str | None = None):
+            if stage == "fit" or stage is None:
+                self.imagenet_train_data = datasets.ImageNet(root='data/imagenet', split='train')
+                self.imagenet_val_data = datasets.ImageNet(root='./data/imagenet', split='val')
+            if stage == "test" or stage is None:
+                self.imagenet_test_data = datasets.ImageNet(root='./data/imagenet', split='val')
+            if stage == "predict" or stage is None:
+                self.imagenet_predict_data = datasets.ImageNet(root='./data/imagenet', split='val')
+        def train_dataloader(self):
+            return DataLoader(self.imagenet_train_data, batch_size=4)
+        def val_dataloader(self):
+            return DataLoader(self.imagenet_val_data, batch_size=4)
+        def test_dataloader(self):
+            return DataLoader(self.imagenet_test_data, batch_size=4)
+        def predict_dataloader(self):
+            return DataLoader(self.imagenet_predict_data, batch_size=4)
+    #####################################################################
+    # please use nni.trace wrap the pl.Trainer class,
+    # NNI will use the trace information to re-initialize the trainer
+    #####################################################################
+    pl_trainer = nni.trace(pl.Trainer)(
+        accelerator='auto',
+        devices=1,
+        max_epochs=1,
+        max_steps=50,
+        logger=TensorBoardLogger('./lightning_logs', name="resnet"),
+    )
+    #####################################################################
+    # please use nni.trace wrap the pl.LightningDataModule class,
+    # NNI will use the trace information to re-initialize the datamodule
+    #####################################################################
+    pl_data = nni.trace(ImageNetDataModule)(data_dir='./data/imagenet')
+    evaluator = LightningEvaluator(pl_trainer, pl_data)
+.. note::
+    In ``LightningModule.configure_optimizers``, user should use traced ``torch.optim.Optimizer`` and traced ``torch.optim._LRScheduler``.
+    It's for NNI can get the initialization parameters of the optimizers and lr_schedulers.
+    .. code-block:: python
+        class SimpleModel(pl.LightningModule):
+            ...
+            def configure_optimizers(self):
+                optimizers = nni.trace(torch.optim.SGD)(model.parameters(), lr=0.001)
+                lr_schedulers = nni.trace(ExponentialLR)(optimizer=optimizers, gamma=0.1)
+                return optimizers, lr_schedulers
+A complete example of pruner using :class:`LightningEvaluator <nni.compression.pytorch.LightningEvaluator>` to compress model can be found :githublink:`here <examples/model_compress/pruning/taylorfo_lightning_evaluator.py>`.
--- a/docs/source/compression/compression_utils.rst
+++ b/docs/source/compression/compression_utils.rst
+Analysis Utils for Model Compression
+====================================
+We provide several easy-to-use tools for users to analyze their model during model compression.
+Sensitivity Analysis
+--------------------
+First, we provide a sensitivity analysis tool (\ **SensitivityAnalysis**\ ) for users to analyze the sensitivity of each convolutional layer in their model. Specifically, the SensitiviyAnalysis gradually prune each layer of the model, and test the accuracy of the model at the same time. Note that, SensitivityAnalysis only prunes a layer once a time, and the other layers are set to their original weights. According to the accuracies of different convolutional layers under different sparsities, we can easily find out which layers the model accuracy is more sensitive to. 
+Usage
+^^^^^
+The following codes show the basic usage of the SensitivityAnalysis.
+.. code-block:: python
+   from nni.compression.pytorch.utils.sensitivity_analysis import SensitivityAnalysis
+   def val(model):
+       model.eval()
+       total = 0
+       correct = 0
+       with torch.no_grad():
+           for batchid, (data, label) in enumerate(val_loader):
+               data, label = data.cuda(), label.cuda()
+               out = model(data)
+               _, predicted = out.max(1)
+               total += data.size(0)
+               correct += predicted.eq(label).sum().item()
+       return correct / total
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val)
+   sensitivity = s_analyzer.analysis(val_args=[net])
+   os.makedir(outdir)
+   s_analyzer.export(os.path.join(outdir, filename))
+Two key parameters of SensitivityAnalysis are ``model``\ , and ``val_func``. ``model`` is the neural network that to be analyzed and the ``val_func`` is the validation function that returns the model accuracy/loss/ or other metrics on the validation dataset. Due to different scenarios may have different ways to calculate the loss/accuracy, so users should prepare a function that returns the model accuracy/loss on the dataset and pass it to SensitivityAnalysis.
+SensitivityAnalysis can export the sensitivity results as a csv file usage is shown in the example above.
+Futhermore, users can specify the sparsities values used to prune for each layer by optional parameter ``sparsities``.
+.. code-block:: python
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75])
+the SensitivityAnalysis will prune 25% 50% 75% weights gradually for each layer, and record the model's accuracy at the same time (SensitivityAnalysis only prune a layer once a time, the other layers are set to their original weights). If the sparsities is not set, SensitivityAnalysis will use the numpy.arange(0.1, 1.0, 0.1) as the default sparsity values.
+Users can also speedup the progress of sensitivity analysis by the early_stop_mode and early_stop_value option. By default, the SensitivityAnalysis will test the accuracy under all sparsities for each layer. In contrast, when the early_stop_mode and early_stop_value are set, the sensitivity analysis for a layer will stop, when the accuracy/loss has already met the threshold set by early_stop_value. We support four early stop modes:  minimize, maximize, dropped, raised.
+minimize: The analysis stops when the validation metric return by the val_func lower than ``early_stop_value``.
+maximize: The analysis stops when the validation metric return by the val_func larger than ``early_stop_value``.
+dropped: The analysis stops when the validation metric has dropped by ``early_stop_value``.
+raised: The analysis stops when the validation metric has raised by ``early_stop_value``.
+.. code-block:: python
+   s_analyzer = SensitivityAnalysis(model=net, val_func=val, sparsities=[0.25, 0.5, 0.75], early_stop_mode='dropped', early_stop_value=0.1)
+If users only want to analyze several specified convolutional layers, users can specify the target conv layers by the ``specified_layers`` in analysis function. ``specified_layers`` is a list that consists of the Pytorch module names of the conv layers. For example
+.. code-block:: python
+   sensitivity = s_analyzer.analysis(val_args=[net], specified_layers=['Conv1'])
+In this example, only the ``Conv1`` layer is analyzed. In addtion, users can quickly and easily achieve the analysis parallelization by launching multiple processes and assigning different conv layers of the same model to each process.
+Output example
+^^^^^^^^^^^^^^
+The following lines are the example csv file exported from SensitivityAnalysis. The first line is constructed by 'layername' and sparsity list. Here the sparsity value means how much weight SensitivityAnalysis prune for each layer. Each line below records the model accuracy when this layer is under different sparsities. Note that, due to the early_stop option, some layers may
+not have model accuracies/losses under all sparsities, for example, its accuracy drop has already exceeded the threshold set by the user.
+.. code-block:: bash
+   layername,0.05,0.1,0.2,0.3,0.4,0.5,0.7,0.85,0.95
+   features.0,0.54566,0.46308,0.06978,0.0374,0.03024,0.01512,0.00866,0.00492,0.00184
+   features.3,0.54878,0.51184,0.37978,0.19814,0.07178,0.02114,0.00438,0.00442,0.00142
+   features.6,0.55128,0.53566,0.4887,0.4167,0.31178,0.19152,0.08612,0.01258,0.00236
+   features.8,0.55696,0.54194,0.48892,0.42986,0.33048,0.2266,0.09566,0.02348,0.0056
+   features.10,0.55468,0.5394,0.49576,0.4291,0.3591,0.28138,0.14256,0.05446,0.01578
+.. _topology-analysis:
+Topology Analysis
+-----------------
+We also provide several tools for the topology analysis during the model compression. These tools are to help users compress their model better. Because of the complex topology of the network, when compressing the model, users often need to spend a lot of effort to check whether the compression configuration is reasonable. So we provide these tools for topology analysis to reduce the burden on users.
+ChannelDependency
+^^^^^^^^^^^^^^^^^
+Complicated models may have residual connection/concat operations in their models. When the user prunes these models, they need to be careful about the channel-count dependencies between the convolution layers in the model. Taking the following residual block in the resnet18 as an example. The output features of the ``layer2.0.conv2`` and ``layer2.0.downsample.0`` are added together, so the number of the output channels of ``layer2.0.conv2`` and ``layer2.0.downsample.0`` should be the same, or there may be a tensor shape conflict.
+.. image:: ../../img/channel_dependency_example.jpg
+   :target: ../../img/channel_dependency_example.jpg
+   :alt: 
+If the layers have channel dependency are assigned with different sparsities (here we only discuss the structured pruning by L1FilterPruner/L2FilterPruner), then there will be a shape conflict during these layers. Even the pruned model with mask works fine, the pruned model cannot be speedup to the final model directly that runs on the devices, because there will be a shape conflict when the model tries to add/concat the outputs of these layers. This tool is to find the layers that have channel count dependencies to help users better prune their model.
+Usage
+"""""
+.. code-block:: python
+   from nni.compression.pytorch.utils.shape_dependency import ChannelDependency
+   data = torch.ones(1, 3, 224, 224).cuda()
+   channel_depen = ChannelDependency(net, data)
+   channel_depen.export('dependency.csv')
+Output Example
+""""""""""""""
+The following lines are the output example of torchvision.models.resnet18 exported by ChannelDependency. The layers at the same line have output channel dependencies with each other. For example, layer1.1.conv2, conv1, and layer1.0.conv2 have output channel dependencies with each other, which means the output channel(filters) numbers of these three layers should be same with each other, otherwise, the model may have shape conflict. 
+.. code-block:: bash
+   Dependency Set,Convolutional Layers
+   Set 1,layer1.1.conv2,layer1.0.conv2,conv1
+   Set 2,layer1.0.conv1
+   Set 3,layer1.1.conv1
+   Set 4,layer2.0.conv1
+   Set 5,layer2.1.conv2,layer2.0.conv2,layer2.0.downsample.0
+   Set 6,layer2.1.conv1
+   Set 7,layer3.0.conv1
+   Set 8,layer3.0.downsample.0,layer3.1.conv2,layer3.0.conv2
+   Set 9,layer3.1.conv1
+   Set 10,layer4.0.conv1
+   Set 11,layer4.0.downsample.0,layer4.1.conv2,layer4.0.conv2
+   Set 12,layer4.1.conv1
+MaskConflict
+^^^^^^^^^^^^
+When the masks of different layers in a model have conflict (for example, assigning different sparsities for the layers that have channel dependency), we can fix the mask conflict by MaskConflict. Specifically, the MaskConflict loads the masks exported by the pruners(L1FilterPruner, etc), and check if there is mask conflict, if so, MaskConflict sets the conflicting masks to the same value.
+.. code-block:: python
+   from nni.compression.pytorch.utils.mask_conflict import fix_mask_conflict
+   fixed_mask = fix_mask_conflict('./resnet18_mask', net, data)
+not_safe_to_prune
+^^^^^^^^^^^^^^^^^
+If we try to prune a layer whose output tensor is taken as the input by a shape-constraint OP(for example, view, reshape), then such pruning maybe not be safe. For example, we have a convolutional layer followed by a view function.
+.. code-block:: python
+   x = self.conv(x) # output shape is (batch, 1024, 3, 3)
+   x = x.view(-1, 1024)
+If the output shape of the pruned conv layer is not divisible by 1024(for example(batch, 500, 3, 3)), we may meet a shape error. We cannot replace such a function that directly operates on the Tensor. Therefore, we need to be careful when pruning such layers. The function not_safe_to_prune finds all the layers followed by a shape-constraint function. Here is an example for usage. If you meet a shape error when running the forward inference on the speeduped model, you can exclude the layers returned by not_safe_to_prune and try again. 
+.. code-block:: python
+   not_safe = not_safe_to_prune(model, dummy_input)
+.. _flops-counter:
+Model FLOPs/Parameters Counter
+------------------------------
+We provide a model counter for calculating the model FLOPs and parameters. This counter supports calculating FLOPs/parameters of a normal model without masks, it can also calculates FLOPs/parameters of a model with mask wrappers, which helps users easily check model complexity during model compression on NNI. Note that, for sturctured pruning, we only identify the remained filters according to its mask, which not taking the pruned input channels into consideration, so the calculated FLOPs will be larger than real number (i.e., the number calculated after Model Speedup). 
+We support two modes to collect information of modules. The first mode is ``default``\ , which only collect the information of convolution and linear. The second mode is ``full``\ , which also collect the information of other operations. Users can easily use our collected ``results`` for futher analysis.
+Usage
+^^^^^
+.. code-block:: python
+   from nni.compression.pytorch.utils import count_flops_params
+   # Given input size (1, 1, 28, 28)
+   flops, params, results = count_flops_params(model, (1, 1, 28, 28)) 
+   # Given input tensor with size (1, 1, 28, 28) and switch to full mode
+   x = torch.randn(1, 1, 28, 28)
+   flops, params, results = count_flops_params(model, (x,), mode='full') # tuple of tensor as input
+   # Format output size to M (i.e., 10^6)
+   print(f'FLOPs: {flops/1e6:.3f}M,  Params: {params/1e6:.3f}M')
+   print(results)
+   {
+   'conv': {'flops': [60], 'params': [20], 'weight_size': [(5, 3, 1, 1)], 'input_size': [(1, 3, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}, 
+   'conv2': {'flops': [100], 'params': [30], 'weight_size': [(5, 5, 1, 1)], 'input_size': [(1, 5, 2, 2)], 'output_size': [(1, 5, 2, 2)], 'module_type': ['Conv2d']}
+   }
--- a/docs/source/compression/overview.rst
+++ b/docs/source/compression/overview.rst
+Overview of NNI Model Compression
+=================================
+Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
+However, typical neural networks are both computationally expensive and energy-intensive,
+which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
+Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
+Model compression techniques can be divided into two categories: pruning and quantization.
+The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
+Quantization refers to compress models by reducing the number of bits required to represent weights or activations.
+We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
+.. image:: ../../img/prune_quant.jpg
+   :target: ../../img/prune_quant.jpg
+   :scale: 40%
+   :align: center
+   :alt:
+NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
+For users to compress their models, they only need to add several lines in their code.
+There are some popular model compression algorithms built-in in NNI.
+On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
+There are several core features supported by NNI model compression:
+* Support many popular pruning and quantization algorithms.
+* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
+* Speedup a compressed model to make it have lower inference latency and also make it smaller.
+* Provide friendly and easy-to-use compression utilities for users to dive into the compression process and results.
+* Concise interface for users to customize their own compression algorithms.
+Compression Pipeline
+--------------------
+.. image:: ../../img/compression_pipeline.png
+   :target: ../../img/compression_pipeline.png
+   :alt:
+   :align: center
+   :scale: 30%
+The overall compression pipeline in NNI is shown above. For compressing a pretrained model, pruning and quantization can be used alone or in combination.
+If users want to apply both, a sequential mode is recommended as common practise.
+.. note::
+  Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
+  To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/quantization_speedup>`. 
+  The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
+Model Speedup
+-------------
+The final goal of model compression is to reduce inference latency and model size.
+However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
+For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
+Given the output masks and quantization bits produced by those algorithms, NNI can really speedup the model.
+The following figure shows how NNI prunes and speeds up your models. 
+.. image:: ../../img/nni_prune_process.png
+   :target: ../../img/nni_prune_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/pruning_speedup>`.
+The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speedup>`.
+.. attention::
+  NNI's model pruning framework has been upgraded to a more powerful version (named pruning v2 before nni v2.6).
+  The old version (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be out of maintenance. If for some reason you have to use the old pruning,
+  v2.6 is the last nni version to support old pruning version.
--- a/docs/source/compression/overview_zh.rst
+++ b/docs/source/compression/overview_zh.rst
+.. b6bdf52910e2e2c72085d03482d45340
+模型压缩
+========
+深度神经网络（DNNs）在计算机视觉、自然语言处理、语音处理等领域取得了巨大的成功。   
+然而，典型的神经网络是计算和能源密集型的，很难将其部署在计算资源匮乏
+或具有严格延迟要求的设备上。 因此，一个自然的想法就是对模型进行压缩，
+以减小模型大小并加速模型训练/推断，同时不会显着降低模型性能。 
+模型压缩技术可以分为两类：剪枝和量化。 剪枝方法探索模型权重中的冗余，
+并尝试删除/修剪冗余和非关键的权重。 量化是指通过减少权重表示或激活所需的比特数来压缩模型。
+在接下来的章节中，我们将进一步阐述这两种方法: 剪枝和量化。 
+此外，下图直观地展示了这两种方法的区别。  
+.. image:: ../../img/prune_quant.jpg
+   :target: ../../img/prune_quant.jpg
+   :scale: 40%
+   :alt:
+NNI 提供了易于使用的工具包来帮助用户设计并使用剪枝和量化算法。
+其使用了统一的接口来支持 TensorFlow 和 PyTorch。
+对用户来说， 只需要添加几行代码即可压缩模型。
+NNI 中也内置了一些主流的模型压缩算法。
+用户可以进一步利用 NNI 的自动调优功能找到最佳的压缩模型，
+该功能在自动模型压缩部分有详细介绍。
+另一方面，用户可以使用 NNI 的接口自定义新的压缩算法。
+NNI 具备以下几个核心特性:
+* 内置许多流行的剪枝和量化算法。
+* 利用最先进的策略和NNI的自动调整能力，来自动化模型剪枝和量化过程。
+* 加速模型，使其有更低的推理延迟。
+* 提供友好和易于使用的压缩工具，让用户深入到压缩过程和结果。
+* 简洁的界面，供用户自定义自己的压缩算法。
+压缩流程
+---------
+.. image:: ../../img/compression_pipeline.png
+   :target: ../../img/compression_pipeline.png
+   :alt:
+   :align: center
+   :scale: 30%
+NNI中模型压缩的整体流程如上图所示。
+为了压缩一个预先训练好的模型，可以单独或联合使用修剪和量化。
+如果用户希望同时应用这两种模式，建议采用串行模式。
+.. note::
+  值得注意的是，NNI的pruner或quantizer并不能改变网络结构，只能模拟压缩的效果。
+  真正能够压缩模型、改变网络结构、降低推理延迟的是NNI的加速工具。
+  为了获得一个真正的压缩的模型，用户需要执行 :doc:`剪枝加速 <../tutorials/pruning_speedup>` or :doc:`量化加速 <../tutorials/quantization_speedup>`. 
+  PyTorch和TensorFlow的接口都是统一的。目前只支持PyTorch版本，未来将支持TensorFlow版本。
+模型加速
+---------
+模型压缩的最终目标是减少推理延迟和模型大小。
+然而，现有的模型压缩算法主要是通过仿真来检测压缩模型的性能。
+例如，修剪算法使用掩码，量化算法仍将值存储在float32中。
+如果能给定这些算法产生的输出掩码和量化位，NNI的加速工具就可以真正地压缩模型。
+下图显示了NNI如何修剪和加速您的模型。
+.. image:: ../../img/nni_prune_process.png
+   :target: ../../img/nni_prune_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+关于用掩码进行模型加速的详细文档可以参考 :doc:`here <../tutorials/pruning_speedup>`.
+关于用校准配置进行模型加速的详细文档可以参考 :doc:`here <../tutorials/quantization_speedup>`.
+.. attention::
+  NNI的模型剪枝框架已经升级到更高级的版本 (在 nni 2.6 版本前称为pruning v2)。
+  旧版本 (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) 不再进行维护. 
+  如果出于某些原因您不得不使用，v2.6 是最后的支持旧版剪枝算法的版本。
--- a/docs/source/compression/pruner.rst
+++ b/docs/source/compression/pruner.rst
+Pruner in NNI
+=============
+NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has).
+The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases.
+There are two kinds of pruners in NNI, please refer to :ref:`basic pruner <basic-pruner>` and :ref:`scheduled pruner <scheduled-pruner>` for details.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - :ref:`level-pruner`
+     - Pruning the specified ratio on each weight element based on absolute value of weight element
+   * - :ref:`l1-norm-pruner`
+     - Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
+   * - :ref:`l2-norm-pruner`
+     - Pruning output channels with the smallest L2 norm of weights
+   * - :ref:`fpgm-pruner`
+     - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration `Reference Paper <https://arxiv.org/abs/1811.00250>`__
+   * - :ref:`slim-pruner`
+     - Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
+   * - :ref:`activation-apoz-rank-pruner`
+     - Pruning output channels based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
+   * - :ref:`activation-mean-rank-pruner`
+     - Pruning output channels based on the metric that calculates the smallest mean value of output activations
+   * - :ref:`taylor-fo-weight-pruner`
+     - Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
+   * - :ref:`admm-pruner`
+     - Pruning based on ADMM optimization technique `Reference Paper <https://arxiv.org/abs/1804.03294>`__
+   * - :ref:`linear-pruner`
+     - Sparsity ratio increases linearly during each pruning rounds, in each round, using a basic pruner to prune the model.
+   * - :ref:`agp-pruner`
+     - Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
+   * - :ref:`lottery-ticket-pruner`
+     - The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
+   * - :ref:`simulated-annealing-pruner`
+     - Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - :ref:`auto-compress-pruner`
+     - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
+   * - :ref:`amc-pruner`
+     - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/abs/1802.03494>`__
+   * - :ref:`movement-pruner`
+     - Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
--- a/docs/source/compression/pruning.rst
+++ b/docs/source/compression/pruning.rst
+Overview of NNI Model Pruning
+=============================
+Pruning is a common technique to compress neural network models.
+The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
+The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
+The following concepts can help you understand pruning in NNI.
+Pruning Target
+--------------
+Pruning target means where we apply the sparsity.
+Most pruning methods prune the weights to reduce the model size and accelerate the inference latency.
+Other pruning methods also apply sparsity on activations (e.g., inputs, outputs, or feature maps) to accelerate the inference latency.
+NNI supports pruning module weights right now, and will support other pruning targets in the future.
+.. _basic-pruner:
+Basic Pruner
+------------
+Basic pruner generates the masks for each pruning target (weights) for a determined sparsity ratio.
+It usually takes model and config as input arguments, then generates masks for each pruning target.
+.. _scheduled-pruner:
+Scheduled Pruner
+----------------
+Scheduled pruner decides how to allocate sparsity ratio to each pruning target,
+it also handles the model speedup (after each pruning iteration) and finetuning logic.
+From the implementation logic, the scheduled pruner is a combination of pruning scheduler, basic pruner and task generator.
+Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning.
+Basic pruner will reset with the model and config list given by task generator then generate the masks.
+For a clearer structure vision, please refer to the figure below.
+.. image:: ../../img/pruning_process.png
+   :target: ../../img/pruning_process.png
+   :scale: 30%
+   :align: center
+   :alt:
+More information about scheduled pruning process please refer to :doc:`Pruning Scheduler <pruning_scheduler>`.
+Granularity
+-----------
+Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
+Coarse-grained pruning or structured pruning is pruning a regular group of weights, such as a convolutional filter.
+Only :ref:`level-pruner` and :ref:`admm-pruner` support fine-grained pruning, all other pruners do some kind of structured pruning on weights.
+.. _dependency-aware-mode-for-output-channel-pruning:
+Dependency-aware Mode for Output Channel Pruning
+------------------------------------------------
+Currently, we support dependency-aware mode in several ``pruner``: :ref:`l1-norm-pruner`, :ref:`l2-norm-pruner`, :ref:`fpgm-pruner`,
+:ref:`activation-apoz-rank-pruner`, :ref:`activation-mean-rank-pruner`, :ref:`taylor-fo-weight-pruner`.
+In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
+the algorithm will quantify the importance of each filter based on some specific metrics(such as l1 norm), and prune the less important output channels.
+We use pruning convolutional layers as an example to explain dependency-aware mode.
+As :ref:`topology analysis utils <topology-analysis>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
+then these two convolutional layers have channel dependency with each other (more details please see :ref:`ChannelDependency <topology-analysis>`).
+Take the following figure as an example.
+.. image:: ../../img/mask_conflict.jpg
+   :target: ../../img/mask_conflict.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
+Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
+In this case, we cannot harvest the speed benefit from the model pruning.
+To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
+In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
+In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
+.. image:: ../../img/dependency-aware.jpg
+   :target: ../../img/dependency-aware.jpg
+   :scale: 80%
+   :align: center
+   :alt: 
+Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
+Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
+Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
+According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
+Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
+For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
+In this case, the ``dependency-aware pruner`` will 
+* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
+* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
+In addition, for the convolutional layers that have more than one filter group,
+``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
+Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process. 
+.. Note:: Operations that will be recognized as having channel dependencies: add/sub/mul/div, addcmul/addcdiv, logical_and/or/xor
+In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
--- a/docs/source/compression/pruning_scheduler.rst
+++ b/docs/source/compression/pruning_scheduler.rst
+Pruning Scheduler
+=================
+Pruning scheduler is new feature supported in pruning v2. It can bring more flexibility for pruning the model iteratively.
+All the built-in iterative pruners (e.g., AGPPruner, SimulatedAnnealingPruner) are based on three abstracted components: pruning scheduler, pruners and task generators.
+In addition to using the NNI built-in iterative pruners,
+users can directly use the pruning schedulers to customize their own iterative pruning logic.
+Workflow of Pruning Scheduler
+-----------------------------
+In iterative pruning, the final goal will be broken down into different small goals, and complete a small goal in each iteration.
+For example, each iteration increases a little sparsity ratio, and after several pruning iterations, the continuous pruned model reaches the final overall sparsity;
+fix the overall sparsity, try different ways to allocate sparsity between layers in each iteration, and find the best allocation way.
+We define a small goal as ``Task``, it usually includes states inherited from previous iterations (eg. pruned model and masks) and description of the current goal (eg. a config list that describes how to allocate sparsity).
+Details about ``Task`` can be found in this :githublink:`file <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+Pruning scheduler handles two main components, a basic pruner, and a task generator. The logic of generating ``Task`` is encapsulated in the task generator.
+In an iteration (one pruning step), pruning scheduler parses the ``Task`` getting from the task generator,
+and reset the pruner by ``model``, ``masks``, ``config_list`` parsing from the ``Task``.
+Then pruning scheduler generates the new masks by the pruner. During an iteration, the new masked model may also experience speed-up, finetuning, and evaluating.
+After one iteration is done, the pruning scheduler collects the compact model, new masks and evaluation score, packages them into ``TaskResult``, and passes it to task generator.
+The iteration process will end until the task generator has no more ``Task``.
+How to Customized Iterative Pruning
+-----------------------------------
+Using AGP Pruning as an example to explain how to implement an iterative pruning by scheduler in NNI.
+.. code-block:: python
+    from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner, PruningScheduler
+    from nni.algorithms.compression.v2.pytorch.pruning.tools import AGPTaskGenerator
+    pruner = L1NormPruner(model=None, config_list=None, mode='dependency_aware', dummy_input=torch.rand(10, 3, 224, 224).to(device))
+    task_generator = AGPTaskGenerator(total_iteration=10, origin_model=model, origin_config_list=config_list, log_dir='.', keep_intermediate_result=True)
+    scheduler = PruningScheduler(pruner, task_generator, finetuner=finetuner, speedup=True, dummy_input=dummy_input, evaluator=None, reset_weight=False)
+    scheduler.compress()
+    _, model, masks, _, _ = scheduler.get_best_result()
+The full script can be found :githublink:`here <examples/model_compress/pruning/scheduler_torch.py>`.
+In this example, we use dependency-aware mode L1 Norm Pruner as a basic pruner during each iteration.
+Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
+Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
+More about Task Generator
+-------------------------
+The task generator is used to give the model that needs to be pruned in each iteration and the corresponding config_list.
+For example, ``AGPTaskGenerator`` will give the model pruned in the previous iteration and compute the sparsity using in the current iteration.
+``TaskGenerator`` put all these pruning information into ``Task`` and pruning scheduler will get the ``Task``, then run it.
+The pruning result will return to the ``TaskGenerator`` at the end of each iteration and ``TaskGenerator`` will judge whether and how to generate the next ``Task``.
+The information included in the ``Task`` and ``TaskResult`` can be found :githublink:`here <nni/algorithms/compression/v2/pytorch/base/scheduler.py>`.
+A clearer iterative pruning flow chart can be found :doc:`here <pruning>`.
+If you want to implement your own task generator, please following the ``TaskGenerator`` :githublink:`interface <nni/algorithms/compression/v2/pytorch/pruning/tools/base.py>`.
+Two main functions should be implemented, ``init_pending_tasks(self) -> List[Task]`` and ``generate_tasks(self, task_result: TaskResult) -> List[Task]``.
+Why Use Pruning Scheduler
+-------------------------
+One of the benefits of using a scheduler to do iterative pruning is users can use more functions of NNI pruning components,
+because of simplicity of the interface and the restoration of the paper, NNI not fully exposing all the low-level interfaces to the upper layer.
+For example, resetting weight value to the original model in each iteration is a key point in lottery ticket pruning algorithm, and this is implemented in ``LotteryTicketPruner``.
+To reduce the complexity of the interface, we only support this function in ``LotteryTicketPruner``, not other pruners.
+If users want to reset weight during each iteration in AGP pruning, ``AGPPruner`` can not do this, but users can easily set ``reset_weight=True`` in ``PruningScheduler`` to implement this.
+What's more, for a customized pruner or task generator, using scheduler can easily enhance the algorithm.
+In addition, users can also customize the scheduling process to implement their own scheduler.
--- a/docs/source/compression/quantization.rst
+++ b/docs/source/compression/quantization.rst
+Overview of NNI Model Quantization
+==================================
+Quantization refers to compressing models by reducing the number of bits required to represent weights or activations,
+which can reduce the computations and the inference time. In the context of deep neural networks, the major numerical
+format for model weights is 32-bit float, or FP32. Many research works have demonstrated that weights and activations
+can be represented using 8-bit integers without significant loss in accuracy. Even lower bit-widths, such as 4/2/1 bits,
+is an active field of research.
+A quantizer is a quantization algorithm implementation in NNI.
+You can also :doc:`create your own quantizer <../tutorials/quantization_customize>` using NNI model compression interface.
--- a/docs/source/compression/quantizer.rst
+++ b/docs/source/compression/quantizer.rst
+Quantizer in NNI
+================
+NNI implements the main part of the quantizaiton algorithm as quantizer. All quantizers are implemented as close as possible to what is described in the paper (if it has).
+The following table provides a brief introduction to the quantizers implemented in nni, click the link in table to view a more detailed introduction and use cases.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Name
+     - Brief Introduction of Algorithm
+   * - :ref:`naive-quantizer`
+     - Quantize weights to default 8 bits
+   * - :ref:`qat-quantizer`
+     - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
+   * - :ref:`dorefa-quantizer`
+     - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. `Reference Paper <https://arxiv.org/abs/1606.06160>`__
+   * - :ref:`bnn-quantizer`
+     - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
+   * - :ref:`lsq-quantizer`
+     - Learned step size quantization. `Reference Paper <https://arxiv.org/pdf/1902.08153.pdf>`__
+   * - :ref:`observer-quantizer`
+     - Post training quantizaiton. Collect quantization information during calibration with observers.
--- a/docs/source/compression/toctree.rst
+++ b/docs/source/compression/toctree.rst
+Compression
+===========
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+    Overview <overview>
+    Pruning <toctree_pruning>
+    Quantization <toctree_quantization>
+    Config Specification <compression_config_list>
+    Evaluator <compression_evaluator>
+    Advanced Usage <advanced_usage>
--- a/docs/source/compression/toctree_pruning.rst
+++ b/docs/source/compression/toctree_pruning.rst
+Pruning
+=======
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+    Overview <pruning>
+    Quickstart </tutorials/pruning_quick_start_mnist>
+    Pruner <pruner>
+    Speedup </tutorials/pruning_speedup>
+    Best Practices <best_practices>
--- a/docs/source/compression/toctree_quantization.rst
+++ b/docs/source/compression/toctree_quantization.rst
+Quantization
+============
+.. toctree::
+    :hidden:
+    :maxdepth: 2
+    Overview <quantization>
+    Quickstart </tutorials/quantization_quick_start_mnist>
+    Quantizer <quantizer>
+    SpeedUp </tutorials/quantization_speedup>
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
+# -*- coding: utf-8 -*-
+#
+# Configuration file for the Sphinx documentation builder.
+#
+# This file does only contain a selection of the most common options. For a
+# full list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+# -- Path setup --------------------------------------------------------------
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import re
+import subprocess
+import sys
+sys.path.insert(0, os.path.abspath('../..'))
+sys.path.insert(0, os.path.abspath('../extension'))
+# -- Project information ---------------------------------------------------
+from datetime import datetime
+project = 'NNI'
+copyright = f'{datetime.now().year}, Microsoft'
+author = 'Microsoft'
+# The short X.Y version
+version = ''
+# The full version, including alpha/beta/rc tags
+# FIXME: this should be written somewhere globally
+release = 'v2.9'
+# -- General configuration ---------------------------------------------------
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx_gallery.gen_gallery',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.mathjax',
+    'sphinxarg4nni.ext',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.intersphinx',
+    'sphinxcontrib.bibtex',
+    'sphinxcontrib.youtube',
+    # 'nbsphinx',  # nbsphinx has conflicts with sphinx-gallery.
+    'sphinx.ext.extlinks',
+    'IPython.sphinxext.ipython_console_highlighting',
+    'sphinx_tabs.tabs',
+    'sphinx_copybutton',
+    # Custom extensions in extension/ folder.
+    'tutorial_links',  # this has to be after sphinx-gallery
+    'getpartialtext',
+    'inplace_translation',
+    'cardlinkitem',
+    'codesnippetcard',
+    'patch_autodoc',
+    'toctree_check',
+]
+# Autosummary related settings
+autosummary_imported_members = True
+autosummary_ignore_module_all = False
+# Auto-generate stub files before building docs
+autosummary_generate = True
+# Add mock modules
+autodoc_mock_imports = [
+    'apex', 'nni_node', 'tensorrt', 'pycuda', 'nn_meter', 'azureml',
+    'ConfigSpace', 'ConfigSpaceNNI', 'smac', 'statsmodels', 'pybnn',
+]
+# Some of our modules cannot generate summary
+autosummary_mock_imports = [
+    'nni.retiarii.codegen.tensorflow',
+    'nni.nas.benchmarks.nasbench101.db_gen',
+    'nni.tools.jupyter_extension.management',
+] + autodoc_mock_imports
+autodoc_typehints = 'description'
+autodoc_typehints_description_target = 'documented'
+autodoc_inherit_docstrings = False
+# Sphinx will warn about all references where the target cannot be found.
+nitpicky = False  # disabled for now
+# A list of regular expressions that match URIs that should not be checked.
+linkcheck_ignore = [
+    r'http://localhost:\d+',
+    r'.*://.*/#/',                           # Modern websites that has URLs like xxx.com/#/guide
+    r'https://github\.com/JSong-Jia/Pic/',   # Community links can't be found any more
+    # Some URLs that often fail
+    r'https://www\.cs\.toronto\.edu/',                      # CIFAR-10
+    r'https://help\.aliyun\.com/document_detail/\d+\.html', # Aliyun
+    r'http://www\.image-net\.org/',                         # ImageNet
+    r'https://www\.msra\.cn/',                              # MSRA
+    r'https://1drv\.ms/',                                   # OneDrive (shortcut)
+    r'https://onedrive\.live\.com/',                        # OneDrive
+    r'https://www\.openml\.org/',                           # OpenML
+    r'https://ml\.informatik\.uni-freiburg\.de/',
+    r'https://docs\.nvidia\.com/deeplearning/',
+]
+# Ignore all links located in release.rst
+linkcheck_exclude_documents = ['^release']
+# Bibliography files
+bibtex_bibfiles = ['refs.bib']
+# Add a heading to bibliography
+bibtex_footbibliography_header = '.. rubric:: Bibliography'
+# Set bibliography style
+bibtex_default_style = 'plain'
+# Sphinx gallery examples
+sphinx_gallery_conf = {
+    'examples_dirs': '../../examples/tutorials',   # path to your example scripts
+    'gallery_dirs': 'tutorials',                   # path to where to save gallery generated output
+    # Control ignored python files.
+    'ignore_pattern': r'__init__\.py|/scripts/',
+    # This is `/plot` by default. Only files starting with `/plot` will be executed.
+    # All files should be executed in our case.
+    'filename_pattern': r'.*',
+    # Disabling download button of all scripts
+    'download_all_examples': False,
+    # Change default thumbnail
+    # Working directory is strange, needs full path.
+    'default_thumb_file': os.path.join(os.path.dirname(__file__), '../img/thumbnails/nni_icon_blue.png'),
+}
+# Copybutton: strip and configure input prompts for code cells.
+copybutton_prompt_text = r">>> |\.\.\. |\$ |In \[\d*\]: | {2,5}\.\.\.: | {5,8}: "
+copybutton_prompt_is_regexp = True
+# Copybutton: customize selector to exclude gallery outputs.
+copybutton_selector = ":not(div.sphx-glr-script-out) > div.highlight pre"
+# Allow additional builders to be considered compatible.
+sphinx_tabs_valid_builders = ['linkcheck']
+# Disallow the sphinx tabs css from loading.
+sphinx_tabs_disable_css_loading = True
+# Some tutorials might need to appear more than once in toc.
+# In this list, we make source/target tutorial pairs.
+# Each "source" tutorial rst will be copied to "target" tutorials.
+# The anchors will be replaced to avoid dupilcate labels.
+# Target should start with ``cp_`` to be properly ignored in git.
+tutorials_copy_list = [
+    # Seems that we don't need it for now.
+    # Add tuples back if we need it in future.
+]
+# Toctree ensures that toctree docs do not contain any other contents.
+# Home page should be an exception.
+toctree_check_whitelist = [
+    'index',
+    # FIXME: Other exceptions should be correctly handled.
+    'compression/index',
+    'compression/pruning',
+    'compression/quantization',
+    'hpo/hpo_benchmark',
+]
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['../templates']
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = ['.rst']
+# The master toctree document.
+master_doc = 'index'
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = 'en'
+# Translation related settings
+locale_dir = ['locales']
+# Documents that requires translation: https://github.com/microsoft/nni/issues/4298
+gettext_documents = [
+    r'^index$',
+    r'^quickstart$',
+    r'^installation$',
+    r'^(nas|hpo|compression)/overview$',
+    r'^tutorials/(hello_nas|pruning_quick_start_mnist|hpo_quickstart_pytorch/main)$',
+]
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [
+    '_build',
+    'Thumbs.db',
+    '.DS_Store',
+    '**.ipynb_checkpoints',
+    # Exclude translations. They will be added back via replacement later if language is set.
+    '**_zh.rst',
+    # Exclude generated tutorials index
+    'tutorials/index.rst',
+]
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = None
+# -- Options for HTML output -------------------------------------------------
+# HTML logo
+html_logo = '../img/nni_icon.svg'
+# HTML favicon
+html_favicon = '../img/favicon.ico'
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_material'
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+html_theme_options = {
+    # Set the name of the project to appear in the navigation.
+    'nav_title': 'Neural Network Intelligence',
+    # Set you GA account ID to enable tracking
+    'google_analytics_account': 'UA-136029994-1',
+    # Specify a base_url used to generate sitemap.xml. If not
+    # specified, then no sitemap will be built.
+    'base_url': 'https://nni.readthedocs.io/',
+    # Set the color and the accent color
+    # Remember to update static/css/material_custom.css when this is updated.
+    # Set those colors in layout.html.
+    'color_primary': 'custom',
+    'color_accent': 'custom',
+    # Set the repo location to get a badge with stats
+    'repo_url': 'https://github.com/microsoft/nni/',
+    'repo_name': 'GitHub',
+    # Visible levels of the global TOC; -1 means unlimited
+    'globaltoc_depth': 5,
+    # Expand all toc so that they can be dynamically collapsed
+    'globaltoc_collapse': False,
+    'version_dropdown': True,
+    # This is a placeholder, which should be replaced later.
+    'version_info': {
+        'current': '/'
+    },
+    # Text to appear at the top of the home page in a "hero" div.
+    'heroes': {
+        'index': 'An open source AutoML toolkit for hyperparameter optimization, neural architecture search, '
+                 'model compression and feature engineering.'
+    }
+}
+# Disable show source link.
+html_show_sourcelink = False
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['../static']
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# The default sidebars (for documents that don't match any pattern) are
+# defined by theme itself.  Builtin themes are using these templates by
+# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
+# 'searchbox.html']``.
+#
+html_sidebars = {
+    "**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"]
+}
+html_title = 'Neural Network Intelligence'
+# Add extra css files and js files
+html_css_files = [
+    'css/material_theme.css',
+    'css/material_custom.css',
+    'css/material_dropdown.css',
+    'css/sphinx_gallery.css',
+    'css/index_page.css',
+]
+html_js_files = [
+    'js/version.js',
+    'js/github.js',
+    'js/sphinx_gallery.js',
+    'js/misc.js'
+]
+# HTML context that can be used in jinja templates
+git_commit_id = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()
+html_context = {
+    'git_commit_id': git_commit_id
+}
+# -- Options for HTMLHelp output ---------------------------------------------
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'NeuralNetworkIntelligencedoc'
+# -- Options for LaTeX output ------------------------------------------------
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'NeuralNetworkIntelligence.tex', 'Neural Network Intelligence Documentation',
+     'Microsoft', 'manual'),
+]
+# -- Options for manual page output ------------------------------------------
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'neuralnetworkintelligence', 'Neural Network Intelligence Documentation',
+     [author], 1)
+]
+# -- Options for Texinfo output ----------------------------------------------
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'NeuralNetworkIntelligence', 'Neural Network Intelligence Documentation',
+     author, 'NeuralNetworkIntelligence', 'One line description of project.',
+     'Miscellaneous'),
+]
+# -- Options for Epub output -------------------------------------------------
+# Bibliographic Dublin Core info.
+epub_title = project
+# The unique identifier of the text. This can be a ISBN number
+# or the project homepage.
+#
+# epub_identifier = ''
+# A unique identification for the text.
+#
+# epub_uid = ''
+# A list of files that should not be packed into the epub file.
+epub_exclude_files = ['search.html']
+# external links (for github code)
+# Reference the code via :githublink:`path/to/your/example/code.py`
+extlinks = {
+    'githublink': ('https://github.com/microsoft/nni/blob/' + git_commit_id + '/%s', 'Github link: %s')
+}
--- a/docs/source/deprecated/oneshot_legacy.rst
+++ b/docs/source/deprecated/oneshot_legacy.rst
+:orphan:
+One-shot Strategy (legacy)
+==========================
+.. warning:: This page will be removed in future releases.
+.. _darts-strategy:
+DARTS
+-----
+The paper `DARTS: Differentiable Architecture Search <https://arxiv.org/abs/1806.09055>`__ addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Their method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
+Authors' code optimizes the network weights and architecture weights alternatively in mini-batches. They further explore the possibility that uses second order optimization (unroll) instead of first order, to improve the performance.
+Implementation on NNI is based on the `official implementation <https://github.com/quark0/darts>`__ and a `popular 3rd-party repo <https://github.com/khanrc/pt.darts>`__. DARTS on NNI is designed to be general for arbitrary search space. A CNN search space tailored for CIFAR10, same as the original paper, is implemented as a use case of DARTS.
+..  autoclass:: nni.retiarii.oneshot.pytorch.DartsTrainer
+Reproduction Results
+^^^^^^^^^^^^^^^^^^^^
+The above-mentioned example is meant to reproduce the results in the paper, we do experiments with first and second order optimization. Due to the time limit, we retrain *only the best architecture* derived from the search phase and we repeat the experiment *only once*. Our results is currently on par with the results reported in paper. We will add more results later when ready.
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - 
+     - In paper
+     - Reproduction
+   * - First order (CIFAR10)
+     - 3.00 +/- 0.14
+     - 2.78
+   * - Second order (CIFAR10)
+     - 2.76 +/- 0.09
+     - 2.80
+Examples
+^^^^^^^^
+:githublink:`Example code <examples/nas/oneshot/darts>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/oneshot/darts
+   python3 search.py
+   # train the best architecture
+   python3 retrain.py --arc-checkpoint ./checkpoints/epoch_49.json
+Limitations
+^^^^^^^^^^^
+* DARTS doesn't support DataParallel and needs to be customized in order to support DistributedDataParallel.
+.. _enas-strategy:
+ENAS
+----
+The paper `Efficient Neural Architecture Search via Parameter Sharing <https://arxiv.org/abs/1802.03268>`__ uses parameter sharing between child models to accelerate the NAS process. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss.
+Implementation on NNI is based on the `official implementation in Tensorflow <https://github.com/melodyguan/enas>`__, including a general-purpose Reinforcement-learning controller and a trainer that trains target network and this controller alternatively. Following paper, we have also implemented macro and micro search space on CIFAR10 to demonstrate how to use these trainers. Since code to train from scratch on NNI is not ready yet, reproduction results are currently unavailable.
+..  autoclass:: nni.retiarii.oneshot.pytorch.EnasTrainer
+Examples
+^^^^^^^^
+:githublink:`Example code <examples/nas/oneshot/enas>`
+.. code-block:: bash
+   # In case NNI code is not cloned. If the code is cloned already, ignore this line and enter code folder.
+   git clone https://github.com/Microsoft/nni.git
+   # search the best architecture
+   cd examples/nas/oneshot/enas
+   # search in macro search space
+   python3 search.py --search-for macro
+   # search in micro search space
+   python3 search.py --search-for micro
+   # view more options for search
+   python3 search.py -h
+.. _fbnet-strategy:
+FBNet
+-----
+.. note:: This one-shot NAS is still implemented under NNI NAS 1.0, and will `be migrated to Retiarii framework in near future <https://github.com/microsoft/nni/issues/3814>`__.
+For the mobile application of facial landmark, based on the basic architecture of PFLD model, we have applied the FBNet (Block-wise DNAS) to design an concise model with the trade-off between latency and accuracy. References are listed as below:
+* `FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search <https://arxiv.org/abs/1812.03443>`__
+* `PFLD: A Practical Facial Landmark Detector <https://arxiv.org/abs/1902.10859>`__
+FBNet is a block-wise differentiable NAS method (Block-wise DNAS), where the best candidate building blocks can be chosen by using Gumbel Softmax random sampling and differentiable training. At each layer (or stage) to be searched, the diverse candidate blocks are side by side planned (just like the effectiveness of structural re-parameterization), leading to sufficient pre-training of the supernet. The pre-trained supernet is further sampled for finetuning of the subnet, to achieve better performance.
+.. image:: ../../img/fbnet.png
+   :width: 800
+   :align: center
+PFLD is a lightweight facial landmark model for realtime application. The architecture of PLFD is firstly simplified for acceleration, by using the stem block of PeleeNet, average pooling with depthwise convolution and eSE module.
+To achieve better trade-off between latency and accuracy, the FBNet is further applied on the simplified PFLD for searching the best block at each specific layer. The search space is based on the FBNet space, and optimized for mobile deployment by using the average pooling with depthwise convolution and eSE module etc.
+Experiments
+^^^^^^^^^^^
+To verify the effectiveness of FBNet applied on PFLD, we choose the open source dataset with 106 landmark points as the benchmark:
+* `Grand Challenge of 106-Point Facial Landmark Localization <https://arxiv.org/abs/1905.03469>`__
+The baseline model is denoted as MobileNet-V3 PFLD (`Reference baseline <https://github.com/Hsintao/pfld_106_face_landmarks>`__), and the searched model is denoted as Subnet. The experimental results are listed as below, where the latency is tested on Qualcomm 625 CPU (ARMv8):
+.. list-table::
+   :header-rows: 1
+   :widths: auto
+   * - Model
+     - Size
+     - Latency
+     - Validation NME
+   * - MobileNet-V3 PFLD
+     - 1.01MB
+     - 10ms
+     - 6.22%
+   * - Subnet
+     - 693KB
+     - 1.60ms
+     - 5.58%
+Example
+^^^^^^^
+`Example code <https://github.com/microsoft/nni/tree/master/examples/nas/oneshot/pfld>`__
+Please run the following scripts at the example directory.
+The Python dependencies used here are listed as below:
+.. code-block:: bash
+   numpy==1.18.5
+   opencv-python==4.5.1.48
+   torch==1.6.0
+   torchvision==0.7.0
+   onnx==1.8.1
+   onnx-simplifier==0.3.5
+   onnxruntime==1.7.0
+To run the tutorial, follow the steps below:
+1. **Data Preparation**: Firstly, you should download the dataset `106points dataset <https://drive.google.com/file/d/1I7QdnLxAlyG2Tq3L66QYzGhiBEoVfzKo/view?usp=sharing>`__ to the path ``./data/106points`` . The dataset includes the train-set and test-set:
+   .. code-block:: bash
+      ./data/106points/train_data/imgs
+      ./data/106points/train_data/list.txt
+      ./data/106points/test_data/imgs
+      ./data/106points/test_data/list.txt
+2. **Search**: Based on the architecture of simplified PFLD, the setting of multi-stage search space and hyper-parameters for searching should be firstly configured to construct the supernet. For example,
+   .. code-block:: python
+      from lib.builder import search_space
+      from lib.ops import PRIMITIVES
+      from lib.supernet import PFLDInference, AuxiliaryNet
+      from nni.algorithms.nas.pytorch.fbnet import LookUpTable, NASConfig
+      # configuration of hyper-parameters
+      # search_space defines the multi-stage search space
+      nas_config = NASConfig(
+         model_dir="./ckpt_save",
+         nas_lr=0.01,
+         mode="mul",
+         alpha=0.25,
+         beta=0.6,
+         search_space=search_space,
+      )
+      # lookup table to manage the information
+      lookup_table = LookUpTable(config=nas_config, primitives=PRIMITIVES)
+      # created supernet
+      pfld_backbone = PFLDInference(lookup_table)
+   After creation of the supernet with the specification of search space and hyper-parameters, we can run below command to start searching and training of the supernet:
+   .. code-block:: bash
+      python train.py --dev_id ^0,1^ --snapshot ^./ckpt_save^ --data_root ^./data/106points^
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/supernet/checkpoint_best.pth``.
+3. **Finetune**: After pre-training of the supernet, we can run below command to sample the subnet and conduct the finetuning:
+   .. code-block:: bash
+      python retrain.py --dev_id ^0,1^ --snapshot ^./ckpt_save^ --data_root ^./data/106points^ \
+                        --supernet ^./ckpt_save/supernet/checkpoint_best.pth^
+   The validation accuracy will be shown during training, and the model with best accuracy will be saved as ``./ckpt_save/subnet/checkpoint_best.pth``.
+4. **Export**: After the finetuning of subnet, we can run below command to export the ONNX model:
+   .. code-block:: bash
+      python export.py --supernet ^./ckpt_save/supernet/checkpoint_best.pth^ \
+                       --resume ^./ckpt_save/subnet/checkpoint_best.pth^
+   ONNX model is saved as ``./output/subnet.onnx``, which can be further converted to the mobile inference engine by using `MNN <https://github.com/alibaba/MNN>`__ .
+   The checkpoints of pre-trained supernet and subnet are offered as below:
+   * `Supernet <https://drive.google.com/file/d/1TCuWKq8u4_BQ84BWbHSCZ45N3JGB9kFJ/view?usp=sharing>`__
+   * `Subnet <https://drive.google.com/file/d/160rkuwB7y7qlBZNM3W_T53cb6MQIYHIE/view?usp=sharing>`__
+   * `ONNX model <https://drive.google.com/file/d/1s-v-aOiMv0cqBspPVF3vSGujTbn_T_Uo/view?usp=sharing>`__
+.. _spos-strategy:
+SPOS
+----
+Proposed in `Single Path One-Shot Neural Architecture Search with Uniform Sampling <https://arxiv.org/abs/1904.00420>`__ is a one-shot NAS method that addresses the difficulties in training One-Shot NAS models by constructing a simplified supernet trained with an uniform path sampling method, so that all underlying architectures (and their weights) get trained fully and equally. An evolutionary algorithm is then applied to efficiently search for the best-performing architectures without any fine tuning.
+Implementation on NNI is based on `official repo <https://github.com/megvii-model/SinglePathOneShot>`__. We implement a trainer that trains the supernet and a evolution tuner that leverages the power of NNI framework that speeds up the evolutionary search phase.
+..  autoclass:: nni.retiarii.oneshot.pytorch.SinglePathTrainer
+Examples
+^^^^^^^^
+Here is a use case, which is the search space in paper. However, we applied latency limit instead of flops limit to perform the architecture search phase.
+:githublink:`Example code <examples/nas/oneshot/spos>`
+**Requirements:** Prepare ImageNet in the standard format (follow the script `here <https://gist.github.com/BIGBALLON/8a71d225eff18d88e469e6ea9b39cef4>`__). Linking it to ``data/imagenet`` will be more convenient. Download the checkpoint file from `here <https://1drv.ms/u/s!Am_mmG2-KsrnajesvSdfsq_cN48?e=aHVppN>`__ (maintained by `Megvii <https://github.com/megvii-model>`__) if you don't want to retrain the supernet. Put ``checkpoint-150000.pth.tar`` under ``data`` directory. After preparation, it's expected to have the following code structure:
+.. code-block:: bash
+   spos
+   ├── architecture_final.json
+   ├── blocks.py
+   ├── data
+   │   ├── imagenet
+   │   │   ├── train
+   │   │   └── val
+   │   └── checkpoint-150000.pth.tar
+   ├── network.py
+   ├── readme.md
+   ├── supernet.py
+   ├── evaluation.py
+   ├── search.py
+   └── utils.py
+Then follow the 3 steps:
+1. **Train Supernet**:
+   .. code-block:: bash
+      python supernet.py
+   This will export the checkpoint to ``checkpoints`` directory, for the next step.
+   .. note:: The data loading used in the official repo is `slightly different from usual <https://github.com/megvii-model/SinglePathOneShot/issues/5>`__, as they use BGR tensor and keep the values between 0 and 255 intentionally to align with their own DL framework. The option ``--spos-preprocessing`` will simulate the behavior used originally and enable you to use the checkpoints pretrained.
+2. **Evolution Search**: Single Path One-Shot leverages evolution algorithm to search for the best architecture. In the paper, the search module, which is responsible for testing the sampled architecture, recalculates all the batch norm for a subset of training images, and evaluates the architecture on the full validation set.
+   In this example, it will inherit the ``state_dict`` of supernet from `./data/checkpoint-150000.pth.tar`, and search the best architecture with the regularized evolution strategy. Search in the supernet with the following command
+   .. code-block:: bash
+      python search.py
+   NNI support a latency filter to filter unsatisfied model from search phase. Latency is predicted by Microsoft nn-Meter (https://github.com/microsoft/nn-Meter). To apply the latency filter, users could run search.py with additional arguments ``--latency-filter``. Here is an example:
+   .. code-block:: bash
+      python search.py --latency-filter cortexA76cpu_tflite21
+   Note that the latency filter is only supported for base execution engine.
+   The final architecture exported from every epoch of evolution can be found in ``trials`` under the working directory of your tuner, which, by default, is ``$HOME/nni-experiments/your_experiment_id/trials``.
+3. **Train for Evaluation**:
+   .. code-block:: bash
+      python evaluation.py
+   By default, it will use ``architecture_final.json``. This architecture is provided by the official repo (converted into NNI format). You can use any architecture (e.g., the architecture found in step 2) with ``--fixed-arc`` option.
+Known Limitations
+^^^^^^^^^^^^^^^^^
+* Block search only. Channel search is not supported yet.
+Current Reproduction Results
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Reproduction is still undergoing. Due to the gap between official release and original paper, we compare our current results with official repo (our run) and paper.
+* Evolution phase is almost aligned with official repo. Our evolution algorithm shows a converging trend and reaches ~65% accuracy at the end of search. Nevertheless, this result is not on par with paper. For details, please refer to `this issue <https://github.com/megvii-model/SinglePathOneShot/issues/6>`__.
+* Retrain phase is not aligned. Our retraining code, which uses the architecture released by the authors, reaches 72.14% accuracy, still having a gap towards 73.61% by official release and 74.3% reported in original paper.
+.. _proxylessnas-strategy:
+ProxylessNAS
+------------
+The paper `ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware <https://arxiv.org/abs/1812.00332>`__ removes proxy, it directly learns the architectures for large-scale target tasks and target hardware platforms. They address high memory consumption issue of differentiable NAS and reduce the computational cost to the same level of regular training while still allowing a large candidate set. Please refer to the paper for the details.
+..  autoclass:: nni.retiarii.oneshot.pytorch.ProxylessTrainer
+To use ProxylessNAS training/searching approach, users need to specify search space in their model using :doc:`NNI NAS interface </nas/construct_space>`, e.g., ``LayerChoice``, ``InputChoice``. After defining and instantiating the model, the following work can be leaved to ProxylessNasTrainer by instantiating the trainer and passing the model to it.
+.. code-block:: python
+   trainer = ProxylessTrainer(model,
+                              loss=LabelSmoothingLoss(),
+                              dataset=None,
+                              optimizer=optimizer,
+                              metrics=lambda output, target: accuracy(output, target, topk=(1, 5,)),
+                              num_epochs=120,
+                              log_frequency=10,
+                              grad_reg_loss_type=args.grad_reg_loss_type, 
+                              grad_reg_loss_params=grad_reg_loss_params, 
+                              applied_hardware=args.applied_hardware, dummy_input=(1, 3, 224, 224),
+                              ref_latency=args.reference_latency)
+   trainer.train()
+   trainer.export(args.arch_path)
+The complete example code can be found :githublink:`here <examples/nas/oneshot/proxylessnas>`.
+Implementation
+^^^^^^^^^^^^^^
+The implementation on NNI is based on the `offical implementation <https://github.com/mit-han-lab/ProxylessNAS>`__. The official implementation supports two training approaches: gradient descent and RL based. In our current implementation on NNI, gradient descent training approach is supported. The complete support of ProxylessNAS is ongoing.
+The official implementation supports different targeted hardware, including 'mobile', 'cpu', 'gpu8', 'flops'.  In NNI repo, the hardware latency prediction is supported by `Microsoft nn-Meter <https://github.com/microsoft/nn-Meter>`__. nn-Meter is an accurate inference latency predictor for DNN models on diverse edge devices. nn-Meter support four hardwares up to now, including ``cortexA76cpu_tflite21``, ``adreno640gpu_tflite21``, ``adreno630gpu_tflite21``, and ``myriadvpu_openvino2019r2``. Users can find more information about nn-Meter on its website. More hardware will be supported in the future. Users could find more details about applying ``nn-Meter`` :doc:`here </nas/hardware_aware_nas>`.
+Below we will describe implementation details. Like other one-shot NAS algorithms on NNI, ProxylessNAS is composed of two parts: *search space* and *training approach*. For users to flexibly define their own search space and use built-in ProxylessNAS training approach, please refer to :githublink:`example code <examples/nas/oneshot/proxylessnas>` for a reference.
+.. image:: ../../img/proxylessnas.png
+   :width: 450
+   :align: center
+ProxylessNAS training approach is composed of ProxylessLayerChoice and ProxylessNasTrainer. ProxylessLayerChoice instantiates MixedOp for each mutable (i.e., LayerChoice), and manage architecture weights in MixedOp. **For DataParallel**, architecture weights should be included in user model. Specifically, in ProxylessNAS implementation, we add MixedOp to the corresponding mutable (i.e., LayerChoice) as a member variable. The ProxylessLayerChoice class also exposes two member functions, i.e., ``resample``, ``finalize_grad``, for the trainer to control the training of architecture weights.
+Reproduction Results
+^^^^^^^^^^^^^^^^^^^^
+To reproduce the result, we first run the search, we found that though it runs many epochs the chosen architecture converges at the first several epochs. This is probably induced by hyper-parameters or the implementation, we are working on it.
+Customization
+-------------
+..  autoclass:: nni.retiarii.oneshot.BaseOneShotTrainer
+    :members:
+..  autofunction:: nni.retiarii.oneshot.pytorch.utils.replace_layer_choice
+..  autofunction:: nni.retiarii.oneshot.pytorch.utils.replace_input_choice
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
+Examples
+========
+More examples can be found in our :githublink:`GitHub repository <examples>`.
+.. cardlinkitem::
+   :header: HPO Quickstart with PyTorch
+   :description: Use HPO to tune a PyTorch FashionMNIST model
+   :link: tutorials/hpo_quickstart_pytorch/main
+   :image: ../img/thumbnails/hpo-pytorch.svg
+   :background: purple
+   :tags: HPO
+.. cardlinkitem::
+   :header: HPO Quickstart with TensorFlow
+   :description: Use HPO to tune a TensorFlow MNIST model
+   :link: tutorials/hpo_quickstart_tensorflow/main
+   :image: ../img/thumbnails/hpo-tensorflow.svg
+   :background: purple
+   :tags: HPO
+.. cardlinkitem::
+   :header: HPO using command line tool
+   :description: Run HPO experiment with nnictl
+   :link: tutorials/hpo_nnictl/nnictl
+   :image: ../img/thumbnails/hpo-pytorch.svg
+   :background: purple
+   :tags: HPO
+.. cardlinkitem::
+   :header: Hello, NAS!
+   :description: Beginners' NAS tutorial on how to search for neural architectures for MNIST dataset.
+   :link: tutorials/hello_nas
+   :image: ../img/thumbnails/nas-tutorial.svg
+   :background: cyan
+   :tags: NAS
+.. cardlinkitem::
+   :header: Use NAS Benchmarks as Datasets
+   :description: Query data from popular NAS benchmarks from our preprocessed benchmark database.
+   :link: tutorials/nasbench_as_dataset
+   :image: ../img/thumbnails/nas-benchmark.svg
+   :background: cyan
+   :tags: NAS
+.. cardlinkitem::
+   :header: Get Started with Model Pruning on MNIST
+   :description: Familiarize yourself with pruning to compress your model 
+   :link: tutorials/pruning_quick_start_mnist
+   :image: ../img/thumbnails/pruning-tutorial.svg
+   :background: blue
+   :tags: Compression
+.. cardlinkitem::
+   :header: Get Started with Model Quantization on MNIST
+   :description: Familiarize yourself with quantization to compress your model
+   :link: tutorials/quantization_quick_start_mnist
+   :image: ../img/thumbnails/quantization-tutorial.svg
+   :background: indigo
+   :tags: Compression
+.. cardlinkitem::
+   :header: Speedup Model with Mask
+   :description: Make your model real smaller and faster with speed-up after pruned by pruner
+   :link: tutorials/pruning_speedup
+   :image: ../img/thumbnails/pruning-speed-up.svg
+   :background: blue
+   :tags: Compression
+.. cardlinkitem::
+   :header: Speedup Model with Calibration Config
+   :description: Make your model real smaller and faster with speed-up after quantized by quantizer
+   :link: tutorials/quantization_speedup
+   :image: ../img/thumbnails/quantization-speed-up.svg
+   :background: indigo
+   :tags: Compression
+.. cardlinkitem::
+   :header: Pruning Bert on Task MNLI
+   :description: An end to end example for how to using NNI pruning transformer and show the real speedup number
+   :link: tutorials/pruning_bert_glue
+   :image: ../img/thumbnails/pruning-tutorial.svg
+   :background: indigo
+   :tags: Compression
--- a/docs/source/experiment/experiment_management.rst
+++ b/docs/source/experiment/experiment_management.rst
+Experiment Management
+=====================
+An experiment can be created with command line tool ``nnictl`` or python APIs. NNI provides both command line tool ``nnictl`` and web Portal to manage the experiments, such as, creating, stopping, resuming, deleting, ranking, and comparing the experiments.
+Management with ``nnictl``
+--------------------------
+The ability of ``nnictl`` on experiment management is almost equivalent to :doc:`web_portal/web_portal`. Users can refer to :doc:`../reference/nnictl` for detailed usage. It is highly suggested when visualization is not well supported in your environment (e.g., web browser is not supported in your environment).
+Management with web portal
+--------------------------
+Experiment management on web potral gives an quick overview of all the experiment on users' machine. Users can easily switch to one experiment from this page. Users can refer to the :ref:`exp-manage-webportal` page for details. The experiment management on web portal is still under intensive development to bring more user-friendly features.
\ No newline at end of file
--- a/docs/source/experiment/overview.rst
+++ b/docs/source/experiment/overview.rst
+Overview of NNI Experiment
+==========================
+An NNI experiment is a unit of one tuning process. For example, it is one run of hyper-parameter tuning on a specific search space, it is one run of neural architecture search on a search space, or it is one run of automatic model compression on user specified goal on latency and accuracy. Usually, the tuning process requires many trials to explore feasible and potentially good-performing models. Thus, an important component of NNI experiment is **training service**, which is a unified interface to abstract diverse computation resources (e.g., local machine, remote servers, AKS). Users can easily run the tuning process on their prefered computation resource and platform. On the other hand, NNI experiment provides **WebUI** to visualize the tuning process to users.
+During developing a DNN model, users need to manage the tuning process, such as, creating an experiment, adjusting an experiment, kill or rerun a trial in an experiment, dumping experiment data for customized analysis. Also, users may create a new experiment for comparison, or concurrently for new model developing tasks. Thus, NNI provides the functionality of **experiment management**. Users can use :doc:`../reference/nnictl` to interact with experiments.
+The relation of the components in NNI experiment is illustrated in the following figure. Hyper-parameter optimization (HPO), neural architecture search (NAS), and model compression are three key features in NNI that help users develop and tune their models. Training serivce provides the ability of parallel running trials on available computation resources. WebUI visualizes the tuning process. *nnictl* is for managing the experiments.
+.. image:: ../../img/experiment_arch.png
+   :scale: 80 %
+   :align: center
+Before reading the following content, you are recommended to go through either :doc:`the quickstart of HPO </tutorials/hpo_quickstart_pytorch/main>` or :doc:`quickstart of NAS </tutorials/hello_nas>` first.
+* :doc:`Overview of NNI training service <training_service/overview>`
+* :doc:`Introduction to Web Portal <web_portal/web_portal>`
+* :doc:`Manange Multiple Experiments <experiment_management>`
--- a/docs/source/experiment/toctree.rst
+++ b/docs/source/experiment/toctree.rst
+Experiment
+==========
+..  toctree::
+    :maxdepth: 2
+    Overview <overview>
+    Training Service <training_service/toctree>
+    Web Portal <web_portal/toctree>
+    Experiment Management <experiment_management>