Currently, we have several filter pruning algorithm for the convolutional layers: FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner. In these filter pruning algorithms, the pruner will prune each convolutional layer separately. While pruning a convolution layer, the algorithm will quantify the importance of each filter based on some specific rules(such as l1-norm), and prune the less important filters.
As `dependency analysis utils <./CompressionUtils.rst>`__ shows, if the output channels of two convolutional layers(conv1, conv2) are added together, then these two conv layers have channel dependency with each other(more details please see `Compression Utils <./CompressionUtils.rst>`__\ ). Take the following figure as an example.
.. image:: ../../img/mask_conflict.jpg
:target: ../../img/mask_conflict.jpg
:alt:
If we prune the first 50% of output channels(filters) for conv1, and prune the last 50% of output channels for conv2. Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels. In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the Filter Pruner. In the dependency-aware mode, the pruner prunes the model not only based on the l1 norm of each filter, but also the topology of the whole network architecture.
In the dependency-aware mode(\ ``dependency_aware`` is set ``True``\ ), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
.. image:: ../../img/dependency-aware.jpg
:target: ../../img/dependency-aware.jpg
:alt:
Take the dependency-aware mode of L1Filter Pruner as an example. Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel. Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set(denoted by ``min_sparsity``\ ). According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers. Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel. For example, suppose the output channels of ``conv1`` , ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively. In this case, the ``dependency-aware pruner`` will
.. code-block:: bash
- First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
- Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
In addition, for the convolutional layers that have more than one filter group, ``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group. Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains(channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
Usage
-----
In this section, we will show how to enable the dependency-aware mode for the filter pruner. Currently, only the one-shot pruners such as FPGM Pruner, L1Filter Pruner, L2Filter Pruner, Activation APoZ Rank Filter Pruner, Activation Mean Rank Filter Pruner, Taylor FO On Weight Pruner, support the dependency-aware mode.
To enable the dependency-aware mode for ``L1FilterPruner``\ :
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
In order to compare the performance of the pruner with or without the dependency-aware mode, we use L1FilterPruner to prune the Mobilenet_v2 separately when the dependency-aware mode is turned on and off. To simplify the experiment, we use the uniform pruning which means we allocate the same sparsity for all convolutional layers in the model.
We trained a Mobilenet_v2 model on the cifar10 dataset and prune the model based on this pretrained checkpoint. The following figure shows the accuracy and FLOPs of the model pruned by different pruners.
.. image:: ../../img/mobilev2_l1_cifar.jpg
:target: ../../img/mobilev2_l1_cifar.jpg
:alt:
In the figure, the ``Dependency-aware`` represents the L1FilterPruner with dependency-aware mode enabled. ``L1 Filter`` is the normal ``L1FilterPruner`` without the dependency-aware mode, and the ``No-Dependency`` means pruner only prunes the layers that has no channel dependency with other layers. As we can see in the figure, when the dependency-aware mode enabled, the pruner can bring higher accuracy under the same Flops.
Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
to check model performance of a specific pruning (or sparsity), but there is no real speedup.
Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
to convert a model to a smaller one based on user provided masks (the masks come from the
pruning algorithms).
There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights, and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer. The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning. To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one. Since the support of sparse kernels in community is limited, we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
Design and Implementation
-------------------------
To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask, or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors, thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change. Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced; second, replace the modules. The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
For each module, we should prepare four functions, three for shape inference and one for module replacement. The three shape inference functions are: given weight shape infer input/output shape, given input shape infer weight/output shape, given output shape infer weight/input shape. The module replacement function returns a newly created module which is smaller.
Usage
-----
.. code-block:: python
from nni.compression.pytorch import ModelSpeedup
# model: the model you want to speed up
# dummy_input: dummy input of the model, given to `jit.trace`
# masks_file: the mask file created by pruning algorithms
For complete examples please refer to :githublink:`the code <examples/model_compress/pruning/speedup/model_speedup.py>`
NOTE: The current implementation supports PyTorch 1.3.1 or newer.
Limitations
-----------
Since every module requires four functions for shape inference and module replacement, this is a large amount of work, we only implemented the ones that are required by the examples. If you want to speed up your own model which cannot supported by the current implementation, you are welcome to contribute.
For PyTorch we can only replace modules, if functions in ``forward`` should be replaced, our current implementation does not work. One workaround is make the function a PyTorch module.
Speedup Results of Examples
---------------------------
The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
slim pruner example
^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01197
- 0.005107
* - 2
- 0.02019
- 0.008769
* - 4
- 0.02733
- 0.014809
* - 8
- 0.04310
- 0.027441
* - 16
- 0.07731
- 0.05008
* - 32
- 0.14464
- 0.10027
fpgm pruner example
^^^^^^^^^^^^^^^^^^^
on cpu,
input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
too large variance
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01383
- 0.01839
* - 2
- 0.01167
- 0.003558
* - 4
- 0.01636
- 0.01088
* - 40
- 0.14412
- 0.08268
* - 40
- 1.29385
- 0.14408
* - 40
- 0.41035
- 0.46162
* - 400
- 6.29020
- 5.82143
l1filter pruner example
^^^^^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01026
- 0.003677
* - 2
- 0.01657
- 0.008161
* - 4
- 0.02458
- 0.020018
* - 8
- 0.03498
- 0.025504
* - 16
- 0.06757
- 0.047523
* - 32
- 0.10487
- 0.086442
APoZ pruner example
^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01389
- 0.004208
* - 2
- 0.01628
- 0.008310
* - 4
- 0.02521
- 0.014008
* - 8
- 0.03386
- 0.023923
* - 16
- 0.06042
- 0.046183
* - 32
- 0.12421
- 0.087113
SimulatedAnnealing pruner example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
The latency is measured on one V100 GPU and the input tensor is ``torch.randn(128, 3, 32, 32)``.
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `参考论文 <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `参考论文 <https://arxiv.org/abs/1602.02830>`__
NNI 模型压缩提供了简洁的接口,用于自定义新的压缩算法。接口的设计理念是,将框架相关的实现细节包装起来,让用户能聚焦于压缩逻辑。用户可以进一步了解我们的压缩框架,并根据我们的框架定制新的压缩算法(剪枝算法或量化算法)。此外,还可利用 NNI 的自动调参功能来自动的压缩模型。参考 `这里 <./advanced.rst>`__ 了解更多细节。
The other way is more detailed. You can customize the dtype and scheme in each quantization config list like:
.. code-block:: python
config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_types':['Conv2d', 'Linear'],
'quant_dtype': 'int',
'quant_scheme': 'per_channel_symmetric'
}, {
'quant_types': ['output'],
'quant_bits': 8,
'quant_start_step': 7000,
'op_types':['ReLU6'],
'quant_dtype': 'uint',
'quant_scheme': 'per_tensor_affine'
}]
Multi-GPU training
^^^^^^^^^^^^^^^^^^^
QAT quantizer natively supports multi-gpu training (DataParallel and DistributedDataParallel). Note that the quantizer
instantiation should happen before you wrap your model with DataParallel or DistributedDataParallel. For example:
.. code-block:: python
from torch.nn.parallel import DistributedDataParallel as DDP
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
model = define_your_model()
model = QAT_Quantizer(model, **other_params) # <--- QAT_Quantizer instantiation
model = DDP(model)
for i in range(epochs):
train(model)
eval(model)
----
LSQ Quantizer
-------------
In `LEARNED STEP SIZE QUANTIZATION <https://arxiv.org/pdf/1902.08153.pdf>`__\ , authors Steven K. Esser and Jeffrey L. McKinstry provide an algorithm to train the scales with gradients.
..
The authors introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer’s quantizer step size, such that it can be learned in conjunction with other network parameters.
Usage
^^^^^
You can add codes below before your training codes. Three things must be done:
1. configure which layer to be quantized and which tensor (input/output/weight) of that layer to be quantized.
2. construct the lsq quantizer
3. call the `compress` API
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.quantization import LsqQuantizer
You can view example for more information. :githublink:`examples/model_compress/quantization/LSQ_torch_quantizer.py <examples/model_compress/quantization/LSQ_torch_quantizer.py>`
User configuration for LSQ Quantizer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
common configuration needed by compression algorithms can be found at `Specification of `config_list <./QuickStart.rst>`__.
configuration needed by this algorithm :
----
DoReFa Quantizer
----------------
In `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ , authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
Usage
^^^^^
To implement DoReFa Quantizer, you can add code below before your training code
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_types': ['default']
}]
quantizer = DoReFaQuantizer(model, config_list)
quantizer.compress()
You can view example for more information
User configuration for DoReFa Quantizer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
configuration needed by this algorithm :
----
BNN Quantizer
-------------
In `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ ,
..
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
You can view example :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py <examples/model_compress/quantization/BNN_quantizer_cifar10.py>` for more information.
User configuration for BNN Quantizer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
common configuration needed by compression algorithms can be found at `Specification of ``config_list`` <./QuickStart.rst>`__.
configuration needed by this algorithm :
Experiment
^^^^^^^^^^
We implemented one of the experiments in `Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ , we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Accuracy
* - VGGNet
- 86.93%
The experiments code can be found at :githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py <examples/model_compress/quantization/BNN_quantizer_cifar10.py>`
Observer Quantizer
------------------
..
Observer quantizer is a framework of post-training quantization. It will insert observers into the place where the quantization will happen. During quantization calibration, each observer will record all the tensors it 'sees'. These tensors will be used to calculate the quantization statistics after calibration.
Usage
^^^^^
1. configure which layer to be quantized and which tensor (input/output/weight) of that layer to be quantized.
2. construct the observer quantizer.
3. do quantization calibration.
4. call the `compress` API to calculate the scale and zero point for each tensor and switch model to evaluation mode.
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
You can view example :githublink:`examples/model_compress/quantization/observer_quantizer.py <examples/model_compress/quantization/observer_quantizer.py>` for more information.
User configuration for Observer Quantizer
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Common configuration needed by compression algorithms can be found at `Specification of `config_list <./QuickStart.rst>`__.
.. note::
This quantizer is still under development for now. Some quantizer settings are hard-coded:
Thespecificationofconfigurationcanbefound`here<./Tutorial.rst#specify-the-configuration>`__.Notethatdifferentprunersmayhavetheirowndefinedfieldsinconfiguration.Pleaserefertoeachpruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.
Step2. Choose a pruner and compress the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model. Note that, some algorithms may check gradients for compressing, so we may also define a trainer, an optimizer, a criterion and pass them to the pruner.
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
pruner = LevelPruner(model, config_list)
model = pruner.compress()
Some pruners (e.g., L1FilterPruner, FPGMPruner) prune once, some pruners (e.g., AGPPruner) prune your model iteratively, the masks are adjusted epoch by epoch during training.
So if the pruners prune your model iteratively or they need training or inference to get gradients, you need pass finetuning logic to pruner.
For example:
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import AGPPruner
Plese refer to :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for example code.
More examples of pruning algorithms can be found in :githublink:`basic_pruners_torch <examples/model_compress/pruning/basic_pruners_torch.py>` and :githublink:`auto_pruners_torch <examples/model_compress/pruning/auto_pruners_torch.py>`.
Model Quantization
------------------
Here we use `QAT Quantizer <../Compression/Quantizer.rst#qat-quantizer>`__ as an example to show the usage of pruning in NNI.
Step1. Write configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
config_list = [{
'quant_types': ['weight', 'input'],
'quant_bits': {
'weight': 8,
'input': 8,
}, # you can just use `int` here because all `quan_types` share same bits length, see config for `ReLu6` below.
'op_types':['Conv2d', 'Linear'],
'quant_dtype': 'int',
'quant_scheme': 'per_channel_symmetric'
}, {
'quant_types': ['output'],
'quant_bits': 8,
'quant_start_step': 7000,
'op_types':['ReLU6'],
'quant_dtype': 'uint',
'quant_scheme': 'per_tensor_affine'
}]
The specification of configuration can be found `here <./Tutorial.rst#quantization-specific-keys>`__.
Step2. Choose a quantizer and compress the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
quantizer = QAT_Quantizer(model, config_list)
quantizer.compress()
Step3. Export compression result
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After training and calibration, you can export model weight to a file, and the generated calibration parameters to a file as well. Exporting onnx model is also supported.
In this tutorial, we will explain more detailed usage about the model compression in NNI.
Setup compression goal
----------------------
Specify the configuration
^^^^^^^^^^^^^^^^^^^^^^^^^
Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object.
The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them.
There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:
* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting. All suported module types are defined in :githublink:`default_layers.py <nni/compression/pytorch/default_layers.py>` for pytorch.
* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.
Some other keys are often specific to a certain algorithm, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
To prune all ``Conv2d`` layers with the sparsity of 0.6, the configuration can be written as:
.. code-block:: python
[{
'sparsity': 0.6,
'op_types': ['Conv2d']
}]
To control the sparsity of specific layers, the configuration can be written as:
.. code-block:: python
[{
'sparsity': 0.8,
'op_types': ['default']
},
{
'sparsity': 0.6,
'op_names': ['op_name1', 'op_name2']
},
{
'exclude': True,
'op_names': ['op_name3']
}]
It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.
Quantization specific keys
^^^^^^^^^^^^^^^^^^^^^^^^^^
Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.
* **quant_types** : list of string.
Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.
* **quant_bits** : int or dict of {str : int}
bits length of quantization, key is the quantization type, value is the quantization bits length, eg.
.. code-block:: python
{
quant_bits: {
'weight': 8,
'output': 4,
},
}
when the value is int type, all quantization types share same bits length. eg.
.. code-block:: python
{
quant_bits: 8, # weight or output quantization are all 8 bits
}
* **quant_dtype** : str or dict of {str : str}
quantization dtype, used to determine the range of quantized value. Two choices can be used:
- int: the range is singed
- uint: the range is unsigned
Two ways to set it. One is that the key is the quantization type, and the value is the quantization dtype, eg.
.. code-block:: python
{
quant_dtype: {
'weight': 'int',
'output': 'uint,
},
}
The other is that the value is str type, and all quantization types share the same dtype. eg.
.. code-block:: python
{
'quant_dtype': 'int', # the dtype of weight and output quantization are all 'int'
}
There are totally two kinds of `quant_dtype` you can set, they are 'int' and 'uint'.
* **quant_scheme** : str or dict of {str : str}
quantization scheme, used to determine the quantization manners. Four choices can used:
- per_tensor_affine: per tensor, asymmetric quantization
- per_tensor_symmetric: per tensor, symmetric quantization
- per_channel_affine: per channel, asymmetric quantization
- per_channel_symmetric: per channel, symmetric quantization
Two ways to set it. One is that the key is the quantization type, value is the quantization scheme, eg.
.. code-block:: python
{
quant_scheme: {
'weight': 'per_channel_symmetric',
'output': 'per_tensor_affine',
},
}
The other is that the value is str type, all quantization types share the same quant_scheme. eg.
.. code-block:: python
{
quant_scheme: 'per_channel_symmetric', # the quant_scheme of weight and output quantization are all 'per_channel_symmetric'
}
There are totally four kinds of `quant_scheme` you can set, they are 'per_tensor_affine', 'per_tensor_symmetric', 'per_channel_affine' and 'per_channel_symmetric'.
The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.
.. code-block:: python
config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_names': ['conv1'],
'quant_dtype': 'int',
'quant_scheme': 'per_channel_symmetric'
},
{
'quant_types': ['weight'],
'quant_bits': 4,
'quant_start_step': 0,
'op_names': ['conv2'],
'quant_dtype': 'int',
'quant_scheme': 'per_tensor_symmetric'
},
{
'quant_types': ['weight'],
'quant_bits': 3,
'op_names': ['fc1'],
'quant_dtype': 'int',
'quant_scheme': 'per_tensor_symmetric'
},
{
'quant_types': ['weight'],
'quant_bits': 2,
'op_names': ['fc2'],
'quant_dtype': 'int',
'quant_scheme': 'per_channel_symmetric'
}]
In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.
Export compression result
-------------------------
Export the pruned model
^^^^^^^^^^^^^^^^^^^^^^^
You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. Note that, the exported ``model.pth``\ has the same parameters as the original model except the masked weights are zero. ``mask_dict`` stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.
You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.
.. code-block:: python
# Save quantized model which is generated by using NNI QAT algorithm
# Get scale, zero_point and weight of conv1 in loaded model
conv1 = qmodel_load.conv1
scale = conv1.module.scale
zero_point = conv1.module.zero_point
weight = conv1.module.weight
Speed up the model
------------------
Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.
.. code-block:: python
from nni.compression.pytorch import apply_compression_results, ModelSpeedup
Please refer to `here <ModelSpeedup.rst>`__ for detailed description. The example code for model speedup can be found :githublink:`here <examples/model_compress/pruning/model_speedup.py>`
Control the Fine-tuning process
-------------------------------
Enhance the fine-tuning process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`