Deep neural networks (DNNs) have achieved great success in many tasks like embedded development and scenarios that needs rapid feedbacks.
.. toctree::
:hidden:
:maxdepth: 2
Pruning <pruning>
Quantization <quantization>
Config Specification <compression_config_list>
Advanced Usage <advanced_usage>
.. attention::
NNI's model pruning framework has been upgraded to a more powerful version (named pruning v2 before nni v2.6).
The old version (`named pruning before nni v2.6 <https://nni.readthedocs.io/en/v2.6/Compression/pruning.html>`_) will be out of maintenance. If for some reason you have to use the old pruning,
v2.6 is the last nni version to support old pruning version.
.. Using rubric to prevent the section heading to be include into toc
.. rubric:: Overview
Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
However, typical neural networks are both computationally expensive and energy-intensive,
However, typical neural networks are both computationally expensive and energy-intensive,
which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
which can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
Model compression techniques can be divided into two categories: pruning and quantization.
Model compression techniques can be divided into two categories: pruning and quantization.
The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
Quantization refers to compressing models by reducing the number of bits required to represent weights or activations functions.
Quantization refers to compress models by reducing the number of bits required to represent weights or activations.
We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
We further elaborate on the two methods, pruning and quantization, in the following chapters. Besides, the figure below visualizes the difference between these two methods.
.. image:: ../../img/prune_quant.jpg
.. image:: ../../img/prune_quant.jpg
:target: ../../img/prune_quant.jpg
:target: ../../img/prune_quant.jpg
:scale: 40%
:scale: 40%
:alt:
:alt:
NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
For users to compress their models, they only need to add several lines in their code.
For users to compress their models, they only need to add several lines in their code.
There are some popular model compression algorithms built-in in NNI.
There are some popular model compression algorithms built-in in NNI.
...
@@ -33,8 +49,7 @@ There are several core features supported by NNI model compression:
...
@@ -33,8 +49,7 @@ There are several core features supported by NNI model compression:
* Concise interface for users to customize their own compression algorithms.
* Concise interface for users to customize their own compression algorithms.
Compression Pipeline
.. rubric:: Compression Pipeline
--------------------
.. image:: ../../img/compression_flow.jpg
.. image:: ../../img/compression_flow.jpg
:target: ../../img/compression_flow.jpg
:target: ../../img/compression_flow.jpg
...
@@ -49,13 +64,7 @@ If users want to apply both, a sequential mode is recommended as common practise
...
@@ -49,13 +64,7 @@ If users want to apply both, a sequential mode is recommended as common practise
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
Supported Algorithms
.. rubric:: Supported Pruning Algorithms
--------------------
The supported model compression algorithms include pruning algorithms and quantization algorithms.
Pruning Algorithms
^^^^^^^^^^^^^^^^^^
Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
...
@@ -99,8 +108,7 @@ Pruning algorithms compress the original network by removing redundant weights o
...
@@ -99,8 +108,7 @@ Pruning algorithms compress the original network by removing redundant weights o
- Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
- Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
Quantization Algorithms
.. rubric:: Supported Quantization Algorithms
^^^^^^^^^^^^^^^^^^^^^^^
Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
...
@@ -124,21 +132,19 @@ Quantization algorithms compress the original network by reducing the number of
...
@@ -124,21 +132,19 @@ Quantization algorithms compress the original network by reducing the number of
- Post training quantizaiton. Collect quantization information during calibration with observers.
- Post training quantizaiton. Collect quantization information during calibration with observers.
Model Speedup
.. rubric:: Model Speedup
-------------
The final goal of model compression is to reduce inference latency and model size.
The final goal of model compression is to reduce inference latency and model size.
However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The following figure shows how NNI prunes and speeds up your models.
Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model.
The following figure shows how NNI prunes and speeds up your models.
.. image:: ../../img/pipeline_compress.jpg
.. image:: ../../img/pipeline_compress.jpg
:target: ../../img/pipeline_compress.jpg
:target: ../../img/pipeline_compress.jpg
:scale: 40%
:scale: 40%
:alt:
:alt:
The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
If you want to compress your model, but don't know what compression algorithm to choose, or don't know what sparsity is suitable for your model, or just want to try more possibilities, auto compression may help you.
Users can choose different compression algorithms and define the algorithms' search space, then auto compression will launch an NNI experiment and try different compression algorithms with varying sparsity automatically.
Of course, in addition to the sparsity rate, users can also introduce other related parameters into the search space.
If you don't know what is search space or how to write search space, `this <./Tutorial/SearchSpaceSpec.rst>`__ is for your reference.
Auto compression using experience is similar to the NNI experiment in python.
The main differences are as follows:
* Use a generator to help generate search space object.
* Need to provide the model to be compressed, and the model should have already been pre-trained.
* No need to set ``trial_command``, additional need to set ``auto_compress_module`` as ``AutoCompressionExperiment`` input.
.. note::
Auto compression only supports TPE Tuner, Random Search Tuner, Anneal Tuner, Evolution Tuner right now.
Generate search space
---------------------
Due to the extensive use of nested search space, we recommend a using generator to configure search space.
The following is an example. Using ``add_config()`` add subconfig, then ``dumps()`` search space dict.
.. code-block:: python
from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionSearchSpaceGenerator
generator = AutoCompressionSearchSpaceGenerator()
generator.add_config('level', [
{
"sparsity": {
"_type": "uniform",
"_value": [0.01, 0.99]
},
'op_types': ['default']
}
])
generator.add_config('qat', [
{
'quant_types': ['weight', 'output'],
'quant_bits': {
'weight': 8,
'output': 8
},
'op_types': ['Conv2d', 'Linear']
}])
search_space = generator.dumps()
Now we support the following pruners and quantizers:
.. code-block:: python
PRUNER_DICT = {
'level': LevelPruner,
'slim': SlimPruner,
'l1': L1FilterPruner,
'l2': L2FilterPruner,
'fpgm': FPGMPruner,
'taylorfo': TaylorFOWeightFilterPruner,
'apoz': ActivationAPoZRankFilterPruner,
'mean_activation': ActivationMeanRankFilterPruner
}
QUANTIZER_DICT = {
'naive': NaiveQuantizer,
'qat': QAT_Quantizer,
'dorefa': DoReFaQuantizer,
'bnn': BNNQuantizer
}
Provide user model for compression
----------------------------------
Users need to inherit ``AbstractAutoCompressionModule`` and override the abstract class function.
.. code-block:: python
from nni.algorithms.compression.pytorch.auto_compress import AbstractAutoCompressionModule
class AutoCompressionModule(AbstractAutoCompressionModule):
Users need to implement at least ``model()`` and ``evaluator()``.
If you use iterative pruner, you need to additional implement ``optimizer_factory()``, ``criterion()`` and ``sparsifying_trainer()``.
If you want to finetune the model after compression, you need to implement ``optimizer_factory()``, ``criterion()``, ``post_compress_finetuning_trainer()`` and ``post_compress_finetuning_epochs()``.
The ``optimizer_factory()`` should return a factory function, the input is an iterable variable, i.e. your ``model.parameters()``, and the output is an optimizer instance.
The two kinds of ``trainer()`` should return a trainer with input ``model, optimizer, criterion, current_epoch``.
The full abstract interface refers to :githublink:`interface.py <nni/algorithms/compression/pytorch/auto_compress/interface.py>`.
An example of ``AutoCompressionModule`` implementation refers to :githublink:`auto_compress_module.py <examples/model_compress/auto_compress/torch/auto_compress_module.py>`.
Launch NNI experiment
---------------------
Similar to launch from python, the difference is no need to set ``trial_command`` and put the user-provided ``AutoCompressionModule`` as ``AutoCompressionExperiment`` input.
.. code-block:: python
from pathlib import Path
from nni.algorithms.compression.pytorch.auto_compress import AutoCompressionExperiment
from auto_compress_module import AutoCompressionModule
In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference :doc:`Framework overview of model compression <legacy_framework>`
Customize a new pruning algorithm
---------------------------------
Implementing a new pruning algorithm requires implementing a ``weight masker`` class which shoud be a subclass of ``WeightMasker``\ , and a ``pruner`` class, which should be a subclass ``Pruner``.
An implementation of ``weight masker`` may look like this:
.. code-block:: python
class MyMasker(WeightMasker):
def __init__(self, model, pruner):
super().__init__(model, pruner)
# You can do some initialization here, such as collecting some statistics data
# if it is necessary for your algorithms to calculate the masks.
# calculate the masks based on the wrapper.weight, and sparsity,
# and anything else
# mask = ...
return {'weight_mask': mask}
You can reference nni provided :githublink:`weight masker <nni/algorithms/compression/pytorch/pruning/structured_pruning_masker.py>` implementations to implement your own weight masker.
Reference nni provided :githublink:`pruner <nni/algorithms/compression/pytorch/pruning/one_shot_pruner.py>` implementations to implement your own pruner class.
----
Customize a new quantization algorithm
--------------------------------------
To write a new quantization algorithm, you can write a class that inherits ``nni.compression.pytorch.Quantizer``. Then, override the member functions with the logic of your algorithm. The member function to override is ``quantize_weight``. ``quantize_weight`` directly returns the quantized weights rather than mask, because for quantization the quantized weights cannot be obtained by applying mask.
.. code-block:: python
from nni.compression.pytorch import Quantizer
class YourQuantizer(Quantizer):
def __init__(self, model, config_list):
"""
Suggest you to use the NNI defined spec for config
quantize should overload this method to quantize input.
This method is effectively hooked to :meth:`forward` of the model.
Parameters
----------
inputs : Tensor
inputs that needs to be quantized
config : dict
the configuration for inputs quantization
"""
# Put your code to generate `new_input` here
return new_input
def update_epoch(self, epoch_num):
pass
def step(self):
"""
Can do some processing based on the model or weights binded
in the func bind_model
"""
pass
Customize backward function
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__ , user can customize a backward function as follow:
.. code-block:: python
from nni.compression.pytorch.compressor import Quantizer, QuantGrad, QuantType
Thereare3majorcomponents/classesinNNImodelcompressionframework:``Compressor``\,``Pruner``and``Quantizer``.Let's look at them in detail one by one:
Compressor
----------
Compressor is the base class for pruner and quntizer, it provides a unified interface for pruner and quantizer for end users, so that pruner and quantizer can be used in the same way. For example, to use a pruner:
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
# load a pretrained model or train a model before using a pruner
configure_list = [{
'sparsity': 0.7,
'op_types': ['Conv2d', 'Linear'],
}]
pruner = LevelPruner(model, configure_list)
model = pruner.compress()
# model is ready for pruning, now start finetune the model,
# the model will be pruned during training automatically
To use a quantizer:
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import DoReFaQuantizer
View :githublink:`example code <examples/model_compress>` for more information.
``Compressor`` class provides some utility methods for subclass and users:
Set wrapper attribute
^^^^^^^^^^^^^^^^^^^^^
Sometimes ``calc_mask`` must save some state data, therefore users can use ``set_wrappers_attribute`` API to register attribute just like how buffers are registered in PyTorch modules. These buffers will be registered to ``module wrapper``. Users can access these buffers through ``module wrapper``.
In above example, we use ``set_wrappers_attribute`` to set a buffer ``if_calculated`` which is used as flag indicating if the mask of a layer is already calculated.
Collect data during forward
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes users want to collect some data during the modules'forwardmethod,forexample,themeanvalueoftheactivation.Thiscanbedonebyaddingacustomizedcollectortomodule.
A pruner receives ``model`` , ``config_list`` as arguments.
Some pruners like ``TaylorFOWeightFilter Pruner`` prune the model per the ``config_list`` during training loop by adding a hook on ``optimizer.step()``.
Pruner class is a subclass of Compressor, so it contains everything in the Compressor class and some additional components only for pruning, it contains:
Weight masker
^^^^^^^^^^^^^
A ``weight masker`` is the implementation of pruning algorithms, it can prune a specified layer wrapped by ``module wrapper`` with specified sparsity.
Pruning module wrapper
^^^^^^^^^^^^^^^^^^^^^^
A ``pruning module wrapper`` is a module containing:
#. the origin module
#. some buffers used by ``calc_mask``
#. a new forward method that applies masks before running the original forward method.
the reasons to use ``module wrapper``\ :
#. some buffers are needed by ``calc_mask`` to calculate masks and these buffers should be registered in ``module wrapper`` so that the original modules are not contaminated.
#. a new ``forward`` method is needed to apply masks to weight before calling the real ``forward`` method.
Pruning hook
^^^^^^^^^^^^
A pruning hook is installed on a pruner when the pruner is constructed, it is used to call pruner'scalc_maskmethodat``optimizer.step()``isinvoked.
Eachmodule/layerofthemodeltobequantizediswrappedbyaquantizationmodulewrapper,itprovidesanew``forward``methodtoquantizetheoriginalmodule's weight, input and output.
Quantization hook
^^^^^^^^^^^^^^^^^
A quantization hook is installed on a quntizer when it is constructed, it is call at ``optimizer.step()``.
Quantization methods
^^^^^^^^^^^^^^^^^^^^
``Quantizer`` class provides following methods for subclass to implement quantization algorithms:
On multi-GPU training, buffers and parameters are copied to multiple GPU every time the ``forward`` method runs on multiple GPU. If buffers and parameters are updated in the ``forward`` method, an ``in-place`` update is needed to ensure the update is effective.
Since ``calc_mask`` is called in the ``optimizer.step`` method, which happens after the ``forward`` method and happens only on one GPU, it supports multi-GPU naturally.
Pruning V2 is a refactoring of the old version and provides more powerful functions.
Pruning is a common technique to compress neural network models.
Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
The pruning methods explore the redundancy in the model weights(parameters) and try to remove/prune the redundant and uncritical weights.
What's more, pruning V2 unifies the pruning process and provides a more free combination of pruning components.
The redundant elements are pruned from the model, their values are zeroed and we make sure they don't take part in the back-propagation process.
Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning in the next step.
Pruner will reset with the model and config list given by task generator then generate the masks in current step.
The following concepts can help you understand pruning in NNI.
.. Using rubric to prevent the section heading to be include into toc
.. rubric:: Pruning Target
Pruning target means where we apply the sparsity.
Most pruning methods prune the weights to reduce the model size and accelerate the inference latency.
Other pruning methods also apply sparsity on the inputs, outputs or intermediate states to accelerate the inference latency.
NNI support pruning module weights right now, and will support other pruning targets in the future.
.. rubric:: Basic Pruner
Basic pruner generates the masks for each pruning targets (weights) for a determined sparsity ratio.
It usually takes model and config as input arguments, then generate a mask for the model.
.. rubric:: Scheduled Pruner
Scheduled pruner decides how to allocate sparsity ratio to each pruning targets, it also handles the pruning speed up and finetuning logic.
From the implementation logic, the scheduled pruner is a combination of pruning scheduler, basic pruner and task generator.
Task generator only cares about the pruning effect that should be achieved in each round, and uses a config list to express how to pruning.
Basic pruner will reset with the model and config list given by task generator then generate the masks.
For a clearer structure vision, please refer to the figure below.
For a clearer structure vision, please refer to the figure below.
.. image:: ../../img/pruning_process.png
.. image:: ../../img/pruning_process.png
:target: ../../img/pruning_process.png
:target: ../../img/pruning_process.png
:scale: 80%
:align: center
:alt:
:alt:
A pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
More information about scheduled pruning process please refer to :doc:`Pruning Scheduler <pruning_scheduler>`.
.. rubric:: Granularity
Fine-grained pruning or unstructured pruning refers to pruning each individual weights separately.
Coarse-grained pruning or structured pruning is pruning entire group of weights, such as a convolutional filter.
:ref:`level-pruner` is the only fine-grained pruner in NNI, all other pruners pruning the output channels on weights.
In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
Take the following figure as an example.
.. image:: ../../img/mask_conflict.jpg
:target: ../../img/mask_conflict.jpg
:scale: 80%
:align: center
:alt:
If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
.. image:: ../../img/dependency-aware.jpg
:target: ../../img/dependency-aware.jpg
:scale: 80%
:align: center
:alt:
Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
In this case, the ``dependency-aware pruner`` will
* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
.. Note::
In addition, for the convolutional layers that have more than one filter group,
``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process.
But users can also use pruner directly like in the pruning V1.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
Take the following figure as an example.
.. image:: ../../img/mask_conflict.jpg
:target: ../../img/mask_conflict.jpg
:alt:
If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
.. image:: ../../img/dependency-aware.jpg
:target: ../../img/dependency-aware.jpg
:alt:
Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
In this case, the ``dependency-aware pruner`` will
* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
In addition, for the convolutional layers that have more than one filter group,
``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in unstructured models, which need specialized hardware or software to speed up the sparse network. **Filter Pruning** achieves acceleration by removing the entire filter. Some pruning algorithms use one-shot method that prune weights at once based on an importance metric (It is necessary to finetune the model to compensate for the loss of accuracy). Other pruning algorithms **iteratively** prune weights during optimization, which control the pruning schedule, including some automatic pruning algorithms.
* `Transformer Head Pruner <#transformer-head-pruner>`__
Level Pruner
------------
This is one basic one-shot pruner: you can set a target sparsity level (expressed as a fraction, 0.6 means we will prune 60% of the weight parameters).
We first sort the weights in the specified layer by their absolute values. And then mask to zero the smallest magnitude weights until the desired sparsity level is reached.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import LevelPruner
This is an one-shot pruner, which adds sparsity regularization on the scaling factors of batch normalization (BN) layers during training to identify unimportant channels. The channels with small scaling factor values will be pruned. For more details, please refer to `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import SlimPruner
We implemented one of the experiments in `Learning Efficient Convolutional Networks through Network Slimming <https://arxiv.org/pdf/1708.06519.pdf>`__\ , we pruned ``70%`` channels in the **VGGNet** for CIFAR-10 in the paper, in which ``88.5%`` parameters are pruned. Our experiments results are as follows:
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Error(paper/ours)
- Parameters
- Pruned
* - VGGNet
- 6.34/6.69
- 20.04M
-
* - Pruned-VGGNet
- 6.20/6.34
- 2.03M
- 88.5%
The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`
This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import FPGMPruner
This is an one-shot pruner, which prunes the filters in the **convolution layers**.
..
The procedure of pruning m filters from the ith convolutional layer is as follows:
#. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.
#. Sort the filters by :math:`s_j`.
#. Prune :math:`m` filters with the smallest sum values and their corresponding feature maps. The
kernels in the next convolutional layer corresponding to the pruned feature maps are also removed.
#. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
weights are copied to the new model.
For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__\.
In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference :ref:`dependency-awareode-for-output-channel-pruning`.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
We implemented one of the experiments in `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__ with **L1FilterPruner**\ , we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which ``64%`` parameters are pruned. Our experiments results are as follows:
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Error(paper/ours)
- Parameters
- Pruned
* - VGG-16
- 6.75/6.49
- 1.5x10^7
-
* - VGG-16-pruned-A
- 6.60/6.47
- 5.4x10^6
- 64.0%
The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`
This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import ActivationAPoZRankFilterPruner
Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
User configuration for ActivationAPoZRankFilter Pruner
ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import ActivationMeanRankFilterPruner
Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.
You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
User configuration for ActivationMeanRankFilterPruner
TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__. Other pruning criteria mentioned in this paper will be supported in future release.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation. Please set parameter 'global_sort' to True when instantiate TaylorFOWeightFilterPruner.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import TaylorFOWeightFilterPruner
This is an iterative pruner, which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step :math:`t_{0}` and with pruning frequency :math:`\Delta t`:
:math:`s_{t}=s_{f}+\left(s_{i}-s_{f}\right)\left(1-\frac{t-t_{0}}{n \Delta t}\right)^{3} \text { for } t \in\left\{t_{0}, t_{0}+\Delta t, \ldots, t_{0} + n \Delta t\right\}`
For more details please refer to `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\.
Usage
^^^^^
You can prune all weights from 0% to 80% sparsity in 10 epoch with the code below.
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import AGPPruner
config_list = [{
'sparsity': 0.8,
'op_types': ['default']
}]
# load a pretrained model or train a model before using a pruner
AGP pruner uses ``LevelPruner`` algorithms to prune the weight by default, however you can set ``pruning_algorithm`` parameter to other values to use other pruning algorithms:
We implement a guided heuristic search method, Simulated Annealing (SA) algorithm, with enhancement on guided search based on prior experience.
The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.
* Randomly initialize a pruning rate distribution (sparsities).
* While current_temperature < stop_temperature:
#. generate a perturbation to current distribution
#. Perform fast evaluation on the perturbated distribution
#. accept the perturbation according to the performance and probability, if not accepted, return to step 1
For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import SimulatedAnnealingPruner
For each round, AutoCompressPruner prune the model for the same sparsity to achive the overall sparsity:
.. code-block:: bash
1. Generate sparsities distribution using SimulatedAnnealingPruner
2. Perform ADMM-based structured pruning to generate pruning result for the next round.
Here we use `speedup` to perform real pruning.
For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import AutoCompressPruner
We implemented one of the experiments in `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__\ , we pruned **MobileNet** to 50% FLOPS for ImageNet in the paper. Our experiments results are as follows:
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Top 1 acc.(paper/ours)
- Top 5 acc. (paper/ours)
- FLOPS
* - MobileNet
- 70.5% / 69.9%
- 89.3% / 89.1%
- 50%
The experiments code can be found at :githublink:`examples/model_compress/pruning/ <examples/model_compress/pruning/amc/>`
ADMM Pruner
-----------
Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
by decomposing the original nonconvex problem into two subproblems that can be solved iteratively. In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively.
During the process of solving these two subproblems, the weights of the original model will be changed. An one-shot pruner will then be applied to prune the model according to the config list given.
This solution framework applies both to non-structured and different variations of structured pruning schemes.
For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import ADMMPruner
`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ , authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis, and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.
In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :
#. Train the network for j iterations, arriving at parameters theta_j.
#. Prune p% of the parameters in theta_j, creating a mask m.
#. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
#. Repeat step 2, 3, and 4.
If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import LotteryTicketPruner
The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.
We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/pruning/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
.. image:: ../../img/lottery_ticket_mnist_fc.png
:target: ../../img/lottery_ticket_mnist_fc.png
:alt:
The above figure shows the result of the fully connected network. ``round0-sparsity-0.0`` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.
Sensitivity Pruner
------------------
For each round, SensitivityPruner prunes the model based on the sensitivity to the accuracy of each layer until meeting the final configured sparsity of the whole model:
.. code-block:: bash
1. Analyze the sensitivity of each layer in the current state of the model.
2. Prune each layer according to the sensitivity.
For more details, please refer to `Learning both Weights and Connections for Efficient Neural Networks <https://arxiv.org/abs/1506.02626>`__.
Usage
^^^^^
PyTorch code
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import SensitivityPruner
Transformer Head Pruner is a tool designed for pruning attention heads from the models belonging to the `Transformer family <https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf>`__. The following image from `Efficient Transformers: A Survey <https://arxiv.org/pdf/2009.06732.pdf>`__ gives a good overview the general structure of the Transformer.
.. image:: ../../img/transformer_structure.png
:target: ../../img/transformer_structure.png
:alt:
Typically, each attention layer in the Transformer models consists of four weights: three projection matrices for query, key, value, and an output projection matrix. The outputs of the former three matrices contains the projected results for all heads. Normally, the results are then reshaped so that each head performs that attention computation independently. The final results are concatenated back before fed into the output projection. Therefore, when an attention head is pruned, the same weights corresponding to that heads in the three projection matrices are pruned. Also, the weights in the output projection corresponding to the head's output are pruned. In our implementation, we calculate and apply masks to the four matrices together.
Note: currently, the pruner can only handle models with projection weights written as separate ``Linear`` modules, i.e., it expects four ``Linear`` modules corresponding to query, key, value, and an output projections. Therefore, in the ``config_list``, you should either write ``['Linear']`` for the ``op_types`` field, or write names corresponding to ``Linear`` modules for the ``op_names`` field. For instance, the `Huggingface transformers <https://huggingface.co/transformers/index.html>`_ are supported, but ``torch.nn.Transformer`` is not.
The pruner implements the following algorithm:
.. code-block:: bash
Repeat for each pruning iteration (1 for one-shot pruning):
1. Calculate importance scores for each head in each specified layer using a specific criterion.
2. Sort heads locally or globally, and prune out some heads with lowest scores. The number of pruned heads is determined according to the sparsity specified in the config.
3. If the specified pruning iteration is larger than 1 (iterative pruning), finetune the model for a while before the next pruning iteration.
Currently, the following head sorting criteria are supported:
* "l1_weight": rank heads by the L1-norm of weights of the query, key, and value projection matrices.
* "l2_weight": rank heads by the L2-norm of weights of the query, key, and value projection matrices.
* "l1_activation": rank heads by the L1-norm of their attention computation output.
* "l2_activation": rank heads by the L2-norm of their attention computation output.
* "taylorfo": rank heads by l1 norm of the output of attention computation * gradient for this output. Check more details in `this paper <https://arxiv.org/abs/1905.10650>`__ and `this one <https://arxiv.org/abs/1611.06440>`__.
We support local sorting (i.e., sorting heads within a layer) and global sorting (sorting all heads together), and you can control by setting the ``global_sort`` parameter. Note that if ``global_sort=True`` is passed, all weights must have the same sparsity in the config list. However, this does not mean that each layer will be prune to the same sparsity as specified. This sparsity value will be interpreted as a global sparsity, and each layer is likely to have different sparsity after pruning by global sort. As a reminder, we found that if global sorting is used, it is usually helpful to use an iterative pruning scheme, interleaving pruning with intermediate finetuning, since global sorting often results in non-uniform sparsity distributions, which makes the model more susceptible to forgetting.
In our implementation, we support two ways to group the four weights in the same layer together. You can either pass a nested list containing the names of these modules as the pruner's initialization parameters (usage below), or simply pass a dummy input instead and the pruner will run ``torch.jit.trace`` to group the weights (experimental feature). However, if you would like to assign different sparsity to each layer, you can only use the first option, i.e., passing names of the weights to the pruner (see usage below). Also, note that we require the weights belonging to the same layer to have the same sparsity.
Usage
^^^^^
Suppose we want to prune a BERT with Huggingface implementation, which has the following architecture (obtained by calling ``print(model)``). Note that we only show the first layer of the repeated layers in the encoder's ``ModuleList layer``.
**Usage Example: one-shot pruning, assigning sparsity 0.5 to the first six layers and sparsity 0.25 to the last six layers (PyTorch code)**. Note that
* Here we specify ``op_names`` in the config list to assign different sparsity to different layers.
* Meanwhile, we pass ``attention_name_groups`` to the pruner so that the pruner may group together the weights belonging to the same attention layer.
* Since in this example we want to do one-shot pruning, the ``num_iterations`` parameter is set to 1, and the parameter ``epochs_per_iteration`` is ignored. If you would like to do iterative pruning instead, you can set the ``num_iterations`` parameter to the number of pruning iterations, and the ``epochs_per_iteration`` parameter to the number of finetuning epochs between two iterations.
* The arguments ``trainer`` and ``optimizer`` are only used when we want to do iterative pruning, or the ranking criterion is ``taylorfo``. Here these two parameters are ignored by the pruner.
* The argument ``forward_runner`` is only used when the ranking criterion is ``l1_activation`` or ``l2_activation``. Here this parameter is ignored by the pruner.
.. code-block:: python
from nni.algorithms.compression.pytorch.pruning import TransformerHeadPruner
attention_name_groups = list(zip(["encoder.layer.{}.attention.self.query".format(i) for i in range(12)],
["encoder.layer.{}.attention.self.key".format(i) for i in range(12)],
["encoder.layer.{}.attention.self.value".format(i) for i in range(12)],
["encoder.layer.{}.attention.output.dense".format(i) for i in range(12)]))
kwargs = {"ranking_criterion": "l1_weight",
"global_sort": False,
"num_iterations": 1,
"epochs_per_iteration": 1, # this is ignored when num_iterations = 1
"head_hidden_dim": 64,
"attention_name_groups": attention_name_groups,
"trainer": trainer,
"optimizer": optimizer,
"forward_runner": forward_runner
}
config_list = [{
"sparsity": 0.5,
"op_types": ["Linear"],
"op_names": [x for layer in attention_name_groups[:6] for x in layer] # first six layers
}, {
"sparsity": 0.25,
"op_types": ["Linear"],
"op_names": [x for layer in attention_name_groups[6:] for x in layer] # last six layers
In addition to this usage guide, we provide a more detailed example of pruning BERT (Huggingface implementation) for transfer learning on the tasks from the `GLUE benchmark <https://gluebenchmark.com/>`_. Please find it in this :githublink:`page <examples/model_compress/pruning/transformers>`. To run the example, first make sure that you install the package ``transformers`` and ``datasets``. Then, you may start by running the following command:
.. code-block:: bash
./run.sh gpu_id glue_task
By default, the code will download a pretrained BERT language model, and then finetune for several epochs on the downstream GLUE task. Then, the ``TransformerHeadPruner`` will be used to prune out heads from each layer by a certain criterion (by default, the code lets the pruner uses magnitude ranking, and prunes out 50% of the heads in each layer in an one-shot manner). Finally, the pruned model will be finetuned in the downstream task for several epochs. You can check the details of pruning from the logs printed out by the example. You can also experiment with different pruning settings by changing the parameters in ``run.sh``, or directly changing the ``config_list`` in ``transformer_pruning.py``.