@@ -31,7 +31,6 @@ We further elaborate on the two methods, pruning and quantization, in the follow
...
@@ -31,7 +31,6 @@ We further elaborate on the two methods, pruning and quantization, in the follow
NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
NNI provides an easy-to-use toolkit to help users design and use model pruning and quantization algorithms.
For users to compress their models, they only need to add several lines in their code.
For users to compress their models, they only need to add several lines in their code.
There are some popular model compression algorithms built-in in NNI.
There are some popular model compression algorithms built-in in NNI.
Users could further use NNI’s auto-tuning power to find the best-compressed model, which is detailed in Auto Model Compression.
On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
On the other hand, users could easily customize their new compression algorithms using NNI’s interface.
There are several core features supported by NNI model compression:
There are several core features supported by NNI model compression:
...
@@ -54,7 +53,7 @@ If users want to apply both, a sequential mode is recommended as common practise
...
@@ -54,7 +53,7 @@ If users want to apply both, a sequential mode is recommended as common practise
.. note::
.. note::
Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
Note that NNI pruners or quantizers are not meant to physically compact the model but for simulating the compression effect. Whereas NNI speedup tool can truly compress model by changing the network architecture and therefore reduce latency.
To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/quantization_speedup>`.
To obtain a truly compact model, users should conduct :doc:`pruning speedup <../tutorials/cp_pruning_speedup>` or :doc:`quantizaiton speedup <../tutorials/cp_quantization_speedup>`.
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
...
@@ -69,7 +68,7 @@ Pruning algorithms compress the original network by removing redundant weights o
...
@@ -69,7 +68,7 @@ Pruning algorithms compress the original network by removing redundant weights o
* - Name
* - Name
- Brief Introduction of Algorithm
- Brief Introduction of Algorithm
* - :ref:`level-pruner`
* - :ref:`level-pruner`
- Pruning the specified ratio on each weight based on absolute values of weights
- Pruning the specified ratio on each weight element based on absolute value of weight element
* - :ref:`l1-norm-pruner`
* - :ref:`l1-norm-pruner`
- Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
- Pruning output channels with the smallest L1 norm of weights (Pruning Filters for Efficient Convnets) `Reference Paper <https://arxiv.org/abs/1608.08710>`__
* - :ref:`l2-norm-pruner`
* - :ref:`l2-norm-pruner`
...
@@ -140,8 +139,8 @@ The following figure shows how NNI prunes and speeds up your models.
...
@@ -140,8 +139,8 @@ The following figure shows how NNI prunes and speeds up your models.
:scale: 40%
:scale: 40%
:alt:
:alt:
The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/pruning_speedup>`.
The detailed tutorial of Speedup Model with Mask can be found :doc:`here <../tutorials/cp_pruning_speedup>`.
The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/quantization_speedup>`.
The detailed tutorial of Speedup Model with Calibration Config can be found :doc:`here <../tutorials/cp_quantization_speedup>`.
In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
In these pruning algorithms, the pruner will prune each layer separately. While pruning a layer,
the algorithm will quantify the importance of each filter based on some specific rules(such as l1 norm), and prune the less important output channels.
the algorithm will quantify the importance of each filter based on some specific metrics(such as l1 norm), and prune the less important output channels.
We use pruning convolutional layers as an example to explain ``dependencyaware`` mode.
We use pruning convolutional layers as an example to explain dependency-aware mode.
As :doc:`dependency analysis utils <./compression_utils>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
As :ref:`topology analysis utils <topology-analysis>` shows, if the output channels of two convolutional layers(conv1, conv2) are added together,
then these two convolutional layers have channel dependency with each other(more details please see :doc:`Compression Utils <./compression_utils>`).
then these two convolutional layers have channel dependency with each other(more details please see :ref:`ChannelDependency <topology-analysis>`).
@@ -42,7 +42,7 @@ Using AGP Pruning as an example to explain how to implement an iterative pruning
...
@@ -42,7 +42,7 @@ Using AGP Pruning as an example to explain how to implement an iterative pruning
The full script can be found :githublink:`here <examples/model_compress/pruning/v2/scheduler_torch.py>`.
The full script can be found :githublink:`here <examples/model_compress/pruning/v2/scheduler_torch.py>`.
In this example, we use ``dependency_aware`` mode L1 Norm Pruner as a basic pruner during each iteration.
In this example, we use dependency-aware mode L1 Norm Pruner as a basic pruner during each iteration.
Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
Note we do not need to pass ``model`` and ``config_list`` to the pruner, because in each iteration the ``model`` and ``config_list`` used by the pruner are received from the task generator.
Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
Then we can use ``scheduler`` as an iterative pruner directly. In fact, this is the implementation of ``AGPPruner`` in NNI.
"\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nIt usually has following paths:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the model\n#. Pruning the model aware training -> Fine-tuning the model\n#. Pruning the model -> Pre-training the compact model\n\nNNI supports the above three modes and mainly focuses on the pruning stage.\nFollow this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
"\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nThere are three common practices for pruning a DNN model:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the pruned model\n#. Pruning a model during training (i.e., pruning aware training) -> Fine-tuning the pruned model\n#. Pruning a model -> Training the pruned model from scratch\n\nNNI supports all of the above pruning practices by working on the key pruning stage.\nFollowing this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
]
]
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"## Preparation\n\nIn this tutorial, we use a simple model and pre-train on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.\n\n"
"## Preparation\n\nIn this tutorial, we use a simple model and pre-trained on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.\n\n"
]
]
},
},
{
{
...
@@ -51,7 +51,7 @@
...
@@ -51,7 +51,7 @@
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"## Pruning Model\n\nUsing L1NormPruner pruning the model and generating the masks.\nUsually, pruners require original model and ``config_list`` as parameters.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.\n\nThis `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
"## Pruning Model\n\nUsing L1NormPruner to prune the model and generate the masks.\nUsually, a pruner requires original model and ``config_list`` as its inputs.\nDetailed about how to write ``config_list`` please refer :doc:`compression config specification <../compression/compression_config_list>`.\n\nThe following `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
"The model has now been wrapped, and quantization targets ('quant_types' setting in `config_list`)\nwill be quantized & dequantized for simulated quantization in the wrapped layers.\nQAT is a training-aware quantizer, it will update scale and zero point during training.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for epoch in range(3):\n trainer(model, optimizer, criterion)\n evaluator(model)"