.. Using rubric to prevent the section heading to be include into toc
.. rubric:: Overview
Overview of NNI Model Compression
=================================
Deep neural networks (DNNs) have achieved great success in many tasks like computer vision, nature launguage processing, speech processing.
However, typical neural networks are both computationally expensive and energy-intensive,
...
...
@@ -43,7 +30,8 @@ There are several core features supported by NNI model compression:
* Concise interface for users to customize their own compression algorithms.
.. rubric:: Compression Pipeline
Compression Pipeline
--------------------
.. image:: ../../img/compression_pipeline.png
:target: ../../img/compression_pipeline.png
...
...
@@ -60,7 +48,8 @@ If users want to apply both, a sequential mode is recommended as common practise
The interface and APIs are unified for both PyTorch and TensorFlow. Currently only PyTorch version has been supported, and TensorFlow version will be supported in future.
.. rubric:: Model Speedup
Model Speedup
-------------
The final goal of model compression is to reduce inference latency and model size.
However, existing model compression algorithms mainly use simulation to check the performance (e.g., accuracy) of compressed model.
Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
NNI implements the main part of the pruning algorithm as pruner. All pruners are implemented as close as possible to what is described in the paper (if it has).
The following table provides a brief introduction to the pruners implemented in nni, click the link in table to view a more detailed introduction and use cases.
There are two kinds of pruners in NNI, please refer to `basic pruner <basic-pruner>`_ and `scheduled pruner <scheduled-pruner>`_ for details.
@@ -99,11 +106,3 @@ In addition, for the convolutional layers that have more than one filter group,
Overall, this pruner will prune the model according to the L1 norm of each filter and try to meet the topological constrains (channel dependency, etc) to improve the final speed gain after the speedup process.
In the dependency-aware mode, the pruner will provide a better speed gain from the model pruning.
Quantization algorithms compress the original network by reducing the number of bits required to represent weights or activations, which can reduce the computations and the inference time.
NNI implements the main part of the quantizaiton algorithm as quantizer. All quantizers are implemented as close as possible to what is described in the paper (if it has).
The following table provides a brief introduction to the quantizers implemented in nni, click the link in table to view a more detailed introduction and use cases.
@@ -75,7 +75,7 @@ NNI provides an easy-to-use model compression framework to compress deep neural
inference speed without losing performance significantlly. Model compression on NNI includes pruning algorithms and quantization algorithms. NNI provides many pruning and
quantization algorithms through NNI trial SDK. Users can directly use them in their trial code and run the trial code without starting an NNI experiment. Users can also use NNI model compression framework to customize their own pruning and quantization algorithms.
A detailed description of model compression and its usage can be found :doc:`here <../compression/index>`.
A detailed description of model compression and its usage can be found :doc:`here <../compression/overview>`.
"# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
"# need to unwrap the model, if the model is wrapped before speedup\npruner._unwrap_model()\n\n# speedup the model, for more information about speedup, please refer :doc:`pruning_speedup`.\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
/home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013:UserWarning:The.gradattributeofaTensorthatisnotaleafTensorisbeingaccessed.Its.gradattributewon't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:417.)
/home/nishang/anaconda3/envs/MCM/lib/python3.9/site-packages/torch/_tensor.py:1013:UserWarning:The.gradattributeofaTensorthatisnotaleafTensorisbeingaccessed.Its.gradattributewon't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811803361/work/build/aten/src/ATen/core/TensorBody.h:417.)
return self._grad
...
...
@@ -278,6 +290,12 @@ the model will become real smaller after speedup