@@ -5,7 +5,7 @@ Customize New Compression Algorithm
...
@@ -5,7 +5,7 @@ Customize New Compression Algorithm
In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
In order to simplify the process of writing new compression algorithms, we have designed simple and flexible programming interface, which covers pruning and quantization. Below, we first demonstrate how to customize a new pruning algorithm and then demonstrate how to customize a new quantization algorithm.
**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference `Framework overview of model compression <../Compression/Framework.rst>`__
**Important Note** To better understand how to customize new pruning/quantization algorithms, users should first understand the framework that supports various pruning algorithms in NNI. Reference :doc:`Framework overview of model compression <legacy_framework>`
Customize a new pruning algorithm
Customize a new pruning algorithm
---------------------------------
---------------------------------
...
@@ -136,7 +136,7 @@ To write a new quantization algorithm, you can write a class that inherits ``nni
...
@@ -136,7 +136,7 @@ To write a new quantization algorithm, you can write a class that inherits ``nni
Customize backward function
Customize backward function
^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__\ , user can customize a backward function as follow:
Sometimes it's necessary for a quantization operation to have a customized backward function, such as `Straight-Through Estimator <https://stackoverflow.com/questions/38361314/the-concept-of-straight-through-estimator-ste>`__ , user can customize a backward function as follow:
Deep neural networks (DNNs) have achieved great success in many tasks.
However, typical neural networks are both computationally expensive and energy-intensive,
can be difficult to be deployed on devices with low computation resources or with strict latency requirements.
Therefore, a natural thought is to perform model compression to reduce model size and accelerate model training/inference without losing performance significantly.
Model compression techniques can be divided into two categories: pruning and quantization.
The pruning methods explore the redundancy in the model weights and try to remove/prune the redundant and uncritical weights.
Quantization refers to compressing models by reducing the number of bits required to represent weights or activations.
As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost becomes critical, especially for some real-time applications. Model compression can be used to address this problem.
NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
For users to compress their models, they only need to add several lines in their code.
NNI provides a model compression toolkit to help user compress and speed up their model with state-of-the-art compression algorithms and strategies. There are several core features supported by NNI model compression:
There are some popular model compression algorithms built-in in NNI.
Users could further use NNI’s auto-tuning power to find the best compressed model, which is detailed in Auto Model Compression.
On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
There are several core features supported by NNI model compression:
* Support many popular pruning and quantization algorithms.
* Support many popular pruning and quantization algorithms.
* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
* Automate model pruning and quantization process with state-of-the-art strategies and NNI's auto tuning power.
...
@@ -25,7 +34,10 @@ Compression Pipeline
...
@@ -25,7 +34,10 @@ Compression Pipeline
The overall compression pipeline in NNI. For compressing a pretrained model, pruning and quantization can be used alone or in combination.
The overall compression pipeline in NNI. For compressing a pretrained model, pruning and quantization can be used alone or in combination.
.. note::
.. note::
Since NNI compression algorithms are not meant to compress model while NNI speedup tool can truly compress model and reduce latency. To obtain a truly compact model, users should conduct `model speedup <./ModelSpeedup.rst>`__. The interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.
Since NNI compression algorithms are not meant to compress model while NNI speedup tool can truly compress model and reduce latency.
To obtain a truly compact model, users should conduct :doc:`model speedup <../tutorials/pruning_speed_up>`.
The interface and APIs are unified for both PyTorch and TensorFlow, currently only PyTorch version has been supported, TensorFlow version will be supported in future.
Supported Algorithms
Supported Algorithms
--------------------
--------------------
...
@@ -43,42 +55,40 @@ Pruning algorithms compress the original network by removing redundant weights o
...
@@ -43,42 +55,40 @@ Pruning algorithms compress the original network by removing redundant weights o
* - Name
* - Name
- Brief Introduction of Algorithm
- Brief Introduction of Algorithm
* - `Level Pruner <Pruner.rst#level-pruner>`__
* - :ref:`level-pruner`
- Pruning the specified ratio on each weight based on absolute values of weights
- Pruning the specified ratio on each weight based on absolute values of weights
- Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
- Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
- The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
- Pruning output channels with the smallest L2 norm of weights
- Pruning filters with the smallest L1 norm of weights in convolution layers (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
- Pruning output channels by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
- Pruning filters with the smallest L2 norm of weights in convolution layers
- Pruning output channels based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
- Pruning filters based on the metric APoZ (average percentage of zeros) which measures the percentage of zeros in activations of (convolutional) layers. `Reference Paper <https://arxiv.org/abs/1607.03250>`__
- Pruning output channels based on the metric that calculates the smallest mean value of output activations
- Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming) `Reference Paper <https://arxiv.org/abs/1708.06519>`__
- Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
- Pruning filters based on the first order taylor expansion on weights(Importance Estimation for Neural Network Pruning) `Reference Paper <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__
- Automated gradual pruning (To prune, or not to prune: exploring the efficacy of pruning for model compression) `Reference Paper <https://arxiv.org/abs/1710.01878>`__
* - :ref:`lottery-ticket-pruner`
- The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. `Reference Paper <https://arxiv.org/abs/1803.03635>`__
* - :ref:`simulated-annealing-pruner`
- Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
- Automatic pruning with a guided heuristic search method, Simulated Annealing algorithm `Reference Paper <https://arxiv.org/abs/1907.03141>`__
- AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
- AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/abs/1802.03494>`__
* - `Transformer Head Pruner <../Compression/Pruner.rst#transformer-head-pruner>`__
* - :ref:`movement-pruner`
- Pruning attention heads from transformer models either in one shot or iteratively.
- Movement Pruning: Adaptive Sparsity by Fine-Tuning `Reference Paper <https://arxiv.org/abs/2005.07683>`__
You can refer to this `benchmark <../CommunitySharings/ModelCompressionComparison.rst>`__ for the performance of these pruners on some benchmark problems.
Quantization Algorithms
Quantization Algorithms
^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^
...
@@ -90,42 +100,26 @@ Quantization algorithms compress the original network by reducing the number of
...
@@ -90,42 +100,26 @@ Quantization algorithms compress the original network by reducing the number of
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. `Reference Paper <http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf>`__
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. `Reference Paper <https://arxiv.org/abs/1602.02830>`__
- Post training quantizaiton. Collect quantization information during calibration with observers.
- Post training quantizaiton. Collect quantization information during calibration with observers.
Model Speedup
Model Speedup
-------------
-------------
The final goal of model compression is to reduce inference latency and model size. However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model, for example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms. Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model. The detailed tutorial of Masked Model Speedup can be found `here <./ModelSpeedup.rst>`__, The detailed tutorial of Mixed Precision Quantization Model Speedup can be found `here <./QuantizationSpeedup.rst>`__.
The final goal of model compression is to reduce inference latency and model size.
However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
For example, using masks for pruning algorithms, and storing quantized values still in float32 for quantization algorithms.
Compression Utilities
Given the output masks and quantization bits produced by those algorithms, NNI can really speed up the model.
---------------------
The detailed tutorial of Speed Up Model with Mask can be found :doc:`here <../tutorials/pruning_speed_up>`.
The detailed tutorial of Speed Up Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speed_up>`.
Compression utilities include some useful tools for users to understand and analyze the model they want to compress. For example, users could check sensitivity of each layer to pruning. Users could easily calculate the FLOPs and parameter size of a model. Please refer to `here <./CompressionUtils.rst>`__ for a complete list of compression utilities.
\ No newline at end of file
Advanced Usage
--------------
NNI model compression leaves simple interface for users to customize a new compression algorithm. The design philosophy of the interface is making users focus on the compression logic while hiding framework specific implementation details from users. Users can learn more about our compression framework and customize a new compression algorithm (pruning algorithm or quantization algorithm) based on our framework. Moreover, users could leverage NNI's auto tuning power to automatically compress a model. Please refer to `here <./advanced.rst>`__ for more details.
Reference and Feedback
----------------------
* To `report a bug <https://github.com/microsoft/nni/issues/new?template=bug-report.rst>`__ for this feature in GitHub;
* To `file a feature or improvement request <https://github.com/microsoft/nni/issues/new?template=enhancement.rst>`__ for this feature in GitHub;
* To know more about `Feature Engineering with NNI <../FeatureEngineering/Overview.rst>`__\ ;
* To know more about `NAS with NNI <../NAS/Overview.rst>`__\ ;
* To know more about `Hyperparameter Tuning with NNI <../Tuner/BuiltinTuner.rst>`__\ ;
Pruning V2 is a refactoring of the old version and provides more powerful functions.
Pruning V2 is a refactoring of the old version and provides more powerful functions.
Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
Compared with the old version, the iterative pruning process is detached from the pruner and the pruner is only responsible for pruning and generating the masks once.
...
@@ -13,14 +13,17 @@ For a clearer structure vision, please refer to the figure below.
...
@@ -13,14 +13,17 @@ For a clearer structure vision, please refer to the figure below.
:target: ../../img/pruning_process.png
:target: ../../img/pruning_process.png
:alt:
:alt:
In V2, a pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
A pruning process is usually driven by a pruning scheduler, it contains a specific pruner and a task generator.
But users can also use pruner directly like in the pruning V1.
For details, please refer to the following tutorials:
.. Note::
But users can also use pruner directly like in the pruning V1.
In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
We use pruning convolutional layers as an example to explain ``dependency aware`` mode.
As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>` ).
Take the following figure as an example.
.. image:: ../../img/mask_conflict.jpg
:target: ../../img/mask_conflict.jpg
:alt:
If we prune the first 50% of output channels (filters) for conv1, and prune the last 50% of output channels for conv2.
Although both layers have pruned 50% of the filters, the speedup module still needs to add zeros to align the output channels.
In this case, we cannot harvest the speed benefit from the model pruning.
To better gain the speed benefit of the model pruning, we add a dependency-aware mode for the ``Pruner`` that can prune the output channels.
In the dependency-aware mode, the pruner prunes the model not only based on the metric of each output channels, but also the topology of the whole network architecture.
In the dependency-aware mode (``dependency_aware`` is set ``True``), the pruner will try to prune the same output channels for the layers that have the channel dependencies with each other, as shown in the following figure.
.. image:: ../../img/dependency-aware.jpg
:target: ../../img/dependency-aware.jpg
:alt:
Take the dependency-aware mode of :ref:`l1-norm-pruner` as an example.
Specifically, the pruner will calculate the L1 norm (for example) sum of all the layers in the dependency set for each channel.
Obviously, the number of channels that can actually be pruned of this dependency set in the end is determined by the minimum sparsity of layers in this dependency set (denoted by ``min_sparsity``).
According to the L1 norm sum of each channel, the pruner will prune the same ``min_sparsity`` channels for all the layers.
Next, the pruner will additionally prune ``sparsity`` - ``min_sparsity`` channels for each convolutional layer based on its own L1 norm of each channel.
For example, suppose the output channels of ``conv1``, ``conv2`` are added together and the configured sparsities of ``conv1`` and ``conv2`` are 0.3, 0.2 respectively.
In this case, the ``dependency-aware pruner`` will
* First, prune the same 20% of channels for `conv1` and `conv2` according to L1 norm sum of `conv1` and `conv2`.
* Second, the pruner will additionally prune 10% channels for `conv1` according to the L1 norm of each channel of `conv1`.
In addition, for the convolutional layers that have more than one filter group,
``dependency-aware pruner`` will also try to prune the same number of the channels for each filter group.
Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
Usage
^^^^^
^^^^^
...
@@ -168,7 +168,7 @@ For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https
...
@@ -168,7 +168,7 @@ For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https
In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.
In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference :ref:`dependency-awareode-for-output-channel-pruning`.
Usage
Usage
^^^^^
^^^^^
...
@@ -225,7 +225,7 @@ L2Filter Pruner
...
@@ -225,7 +225,7 @@ L2Filter Pruner
This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
Usage
Usage
^^^^^
^^^^^
...
@@ -333,7 +333,7 @@ TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based
...
@@ -333,7 +333,7 @@ TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.
We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference :ref:`dependency-awareode-for-output-channel-pruning` for more details.
What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation. Please set parameter 'global_sort' to True when instantiate TaylorFOWeightFilterPruner.
What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation. Please set parameter 'global_sort' to True when instantiate TaylorFOWeightFilterPruner.