Unverified Commit db3130d7 authored by J-shang's avatar J-shang Committed by GitHub
Browse files

[Doc] Compression (#4574)

parent cef9babd
...@@ -10,6 +10,10 @@ Tutorials ...@@ -10,6 +10,10 @@ Tutorials
tutorials/nni_experiment tutorials/nni_experiment
tutorials/hello_nas tutorials/hello_nas
tutorials/nasbench_as_dataset tutorials/nasbench_as_dataset
tutorials/pruning_quick_start_mnist
tutorials/pruning_speed_up
tutorials/quantization_quick_start_mnist
tutorials/quantization_speed_up
.. ---------------------- .. ----------------------
...@@ -35,3 +39,35 @@ Tutorials ...@@ -35,3 +39,35 @@ Tutorials
:image: ../img/thumbnails/overview-30.png :image: ../img/thumbnails/overview-30.png
:background: pink :background: pink
:tags: NAS :tags: NAS
.. cardlinkitem::
:header: Get Started with Model Pruning on MNIST
:description: Familiarize yourself with pruning to compress your model
:link: tutorials/pruning_quick_start_mnist.html
:image: ../img/thumbnails/overview-29.png
:background: cyan
:tags: Compression
.. cardlinkitem::
:header: Get Started with Model Quantization on MNIST
:description: Familiarize yourself with quantization to compress your model
:link: tutorials/quantization_quick_start_mnist.html
:image: ../img/thumbnails/overview-29.png
:background: cyan
:tags: Compression
.. cardlinkitem::
:header: Speed Up Model with Mask
:description: Make your model real smaller and faster with speed-up after pruned by pruner
:link: tutorials/pruning_speed_up.html
:image: ../img/thumbnails/overview-29.png
:background: cyan
:tags: Compression
.. cardlinkitem::
:header: Speed Up Model with Calibration Config
:description: Make your model real smaller and faster with speed-up after quantized by quantizer
:link: tutorials/quantization_speed_up.html
:image: ../img/thumbnails/overview-29.png
:background: cyan
:tags: Compression
...@@ -9,6 +9,27 @@ Tutorials ...@@ -9,6 +9,27 @@ Tutorials
.. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip="Introduction ------------">
.. only:: html
.. figure:: /tutorials/images/thumb/sphx_glr_pruning_speed_up_thumb.png
:alt: Speed Up Model with Mask
:ref:`sphx_glr_tutorials_pruning_speed_up.py`
.. raw:: html
</div>
.. toctree::
:hidden:
/tutorials/pruning_speed_up
.. raw:: html .. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip="Start and Manage a New Experiment"> <div class="sphx-glr-thumbcontainer" tooltip="Start and Manage a New Experiment">
...@@ -30,6 +51,69 @@ Tutorials ...@@ -30,6 +51,69 @@ Tutorials
/tutorials/nni_experiment /tutorials/nni_experiment
.. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip="Model pruning is a technique to reduce the model size and computation by reducing model weight ...">
.. only:: html
.. figure:: /tutorials/images/thumb/sphx_glr_pruning_quick_start_mnist_thumb.png
:alt: Pruning Quickstart
:ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py`
.. raw:: html
</div>
.. toctree::
:hidden:
/tutorials/pruning_quick_start_mnist
.. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip="Quantization reduces model size and speeds up inference time by reducing the number of bits req...">
.. only:: html
.. figure:: /tutorials/images/thumb/sphx_glr_quantization_quick_start_mnist_thumb.png
:alt: Quantization Quickstart
:ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py`
.. raw:: html
</div>
.. toctree::
:hidden:
/tutorials/quantization_quick_start_mnist
.. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip=" Introduction ------------">
.. only:: html
.. figure:: /tutorials/images/thumb/sphx_glr_quantization_speed_up_thumb.png
:alt: Speed Up Model with Calibration Config
:ref:`sphx_glr_tutorials_quantization_speed_up.py`
.. raw:: html
</div>
.. toctree::
:hidden:
/tutorials/quantization_speed_up
.. raw:: html .. raw:: html
<div class="sphx-glr-thumbcontainer" tooltip="In this tutorial, we show how to use NAS Benchmarks as datasets. For research purposes we somet..."> <div class="sphx-glr-thumbcontainer" tooltip="In this tutorial, we show how to use NAS Benchmarks as datasets. For research purposes we somet...">
......
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Pruning Quickstart\n\nModel pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.\nIt usually has following paths:\n\n#. Pre-training a model -> Pruning the model -> Fine-tuning the model\n#. Pruning the model aware training -> Fine-tuning the model\n#. Pruning the model -> Pre-training the compact model\n\nNNI supports the above three modes and mainly focuses on the pruning stage.\nFollow this tutorial for a quick look at how to use NNI to prune a model in a common practice.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparation\n\nIn this tutorial, we use a simple model and pre-train on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\n\nfrom scripts.compression_mnist_model import TorchModel, trainer, evaluator, device\n\n# define the model\nmodel = TorchModel().to(device)\n\n# define the optimizer and criterion for pre-training\n\noptimizer = SGD(model.parameters(), 1e-2)\ncriterion = F.nll_loss\n\n# pre-train and evaluate the model on MNIST dataset\nfor epoch in range(3):\n trainer(model, optimizer, criterion)\n evaluator(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pruning Model\n\nUsing L1NormPruner pruning the model and generating the masks.\nUsually, pruners require original model and ``config_list`` as parameters.\nDetailed about how to write ``config_list`` please refer ...\n\nThis `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,\nexcept the layer named `fc3`, because `fc3` is `exclude`.\nThe final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"config_list = [{\n 'sparsity_per_layer': 0.5,\n 'op_types': ['Linear', 'Conv2d']\n}, {\n 'exclude': True,\n 'op_names': ['fc3']\n}]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pruners usually require `model` and `config_list` as input arguments.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner\n\npruner = L1NormPruner(model, config_list)\n# show the wrapped model structure\nprint(model)\n# compress the model and generate the masks\n_, masks = pruner.compress()\n# show the masks sparsity\nfor name, mask in masks.items():\n print(name, ' sparsity: ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.\nThe model becomes smaller after speed-up,\nand reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# need to unwrap the model, if the model is wrapped before speed up\npruner._unwrap_model()\n\n# speed up the model\nfrom nni.compression.pytorch.speedup import ModelSpeedup\n\nModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the model will become real smaller after speed up\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fine-tuning Compacted Model\nNote that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.\nBecause speed up will replace the masked big layers with dense small ones.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"optimizer = SGD(model.parameters(), 1e-2)\nfor epoch in range(3):\n trainer(model, optimizer, criterion)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
"""
Pruning Quickstart
==================
Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
It usually has following paths:
#. Pre-training a model -> Pruning the model -> Fine-tuning the model
#. Pruning the model aware training -> Fine-tuning the model
#. Pruning the model -> Pre-training the compact model
NNI supports the above three modes and mainly focuses on the pruning stage.
Follow this tutorial for a quick look at how to use NNI to prune a model in a common practice.
"""
# %%
# Preparation
# -----------
#
# In this tutorial, we use a simple model and pre-train on MNIST dataset.
# If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# Pruning Model
# -------------
#
# Using L1NormPruner pruning the model and generating the masks.
# Usually, pruners require original model and ``config_list`` as parameters.
# Detailed about how to write ``config_list`` please refer ...
#
# This `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
# except the layer named `fc3`, because `fc3` is `exclude`.
# The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
config_list = [{
'sparsity_per_layer': 0.5,
'op_types': ['Linear', 'Conv2d']
}, {
'exclude': True,
'op_names': ['fc3']
}]
# %%
# Pruners usually require `model` and `config_list` as input arguments.
from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
pruner = L1NormPruner(model, config_list)
# show the wrapped model structure
print(model)
# compress the model and generate the masks
_, masks = pruner.compress()
# show the masks sparsity
for name, mask in masks.items():
print(name, ' sparsity: ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
# %%
# Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
# The model becomes smaller after speed-up,
# and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
# need to unwrap the model, if the model is wrapped before speed up
pruner._unwrap_model()
# speed up the model
from nni.compression.pytorch.speedup import ModelSpeedup
ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
# %%
# the model will become real smaller after speed up
print(model)
# %%
# Fine-tuning Compacted Model
# ---------------------------
# Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
# Because speed up will replace the masked big layers with dense small ones.
optimizer = SGD(model.parameters(), 1e-2)
for epoch in range(3):
trainer(model, optimizer, criterion)
775624e7a28ae5c6eb2027eace7fff67
\ No newline at end of file
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/pruning_quick_start_mnist.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here <sphx_glr_download_tutorials_pruning_quick_start_mnist.py>`
to download the full example code
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_tutorials_pruning_quick_start_mnist.py:
Pruning Quickstart
==================
Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
It usually has following paths:
#. Pre-training a model -> Pruning the model -> Fine-tuning the model
#. Pruning the model aware training -> Fine-tuning the model
#. Pruning the model -> Pre-training the compact model
NNI supports the above three modes and mainly focuses on the pruning stage.
Follow this tutorial for a quick look at how to use NNI to prune a model in a common practice.
.. GENERATED FROM PYTHON SOURCE LINES 17-22
Preparation
-----------
In this tutorial, we use a simple model and pre-train on MNIST dataset.
If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
.. GENERATED FROM PYTHON SOURCE LINES 22-42
.. code-block:: default
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Average test loss: 1.8381, Accuracy: 5939/10000 (59%)
Average test loss: 0.3143, Accuracy: 9045/10000 (90%)
Average test loss: 0.1928, Accuracy: 9387/10000 (94%)
.. GENERATED FROM PYTHON SOURCE LINES 43-53
Pruning Model
-------------
Using L1NormPruner pruning the model and generating the masks.
Usually, pruners require original model and ``config_list`` as parameters.
Detailed about how to write ``config_list`` please refer ...
This `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
except the layer named `fc3`, because `fc3` is `exclude`.
The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
.. GENERATED FROM PYTHON SOURCE LINES 53-62
.. code-block:: default
config_list = [{
'sparsity_per_layer': 0.5,
'op_types': ['Linear', 'Conv2d']
}, {
'exclude': True,
'op_names': ['fc3']
}]
.. GENERATED FROM PYTHON SOURCE LINES 63-64
Pruners usually require `model` and `config_list` as input arguments.
.. GENERATED FROM PYTHON SOURCE LINES 64-76
.. code-block:: default
from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
pruner = L1NormPruner(model, config_list)
# show the wrapped model structure
print(model)
# compress the model and generate the masks
_, masks = pruner.compress()
# show the masks sparsity
for name, mask in masks.items():
print(name, ' sparsity: ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
TorchModel(
(conv1): PrunerModuleWrapper(
(module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
)
(conv2): PrunerModuleWrapper(
(module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
)
(fc1): PrunerModuleWrapper(
(module): Linear(in_features=256, out_features=120, bias=True)
)
(fc2): PrunerModuleWrapper(
(module): Linear(in_features=120, out_features=84, bias=True)
)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
conv1 sparsity: 0.5
conv2 sparsity: 0.5
fc1 sparsity: 0.5
fc2 sparsity: 0.5
.. GENERATED FROM PYTHON SOURCE LINES 77-80
Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
The model becomes smaller after speed-up,
and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
.. GENERATED FROM PYTHON SOURCE LINES 80-89
.. code-block:: default
# need to unwrap the model, if the model is wrapped before speed up
pruner._unwrap_model()
# speed up the model
from nni.compression.pytorch.speedup import ModelSpeedup
ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
/home/ningshang/nni/nni/compression/pytorch/utils/mask_conflict.py:124: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
all_ones = (w_mask.flatten(1).sum(-1) == count).nonzero().squeeze(1).tolist()
/home/ningshang/nni/nni/compression/pytorch/speedup/infer_mask.py:262: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.
if isinstance(self.output, torch.Tensor) and self.output.grad is not None:
/home/ningshang/nni/nni/compression/pytorch/speedup/compressor.py:282: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.
if last_output.grad is not None and tin.grad is not None:
.. GENERATED FROM PYTHON SOURCE LINES 90-91
the model will become real smaller after speed up
.. GENERATED FROM PYTHON SOURCE LINES 91-93
.. code-block:: default
print(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
TorchModel(
(conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=128, out_features=60, bias=True)
(fc2): Linear(in_features=60, out_features=42, bias=True)
(fc3): Linear(in_features=42, out_features=10, bias=True)
)
.. GENERATED FROM PYTHON SOURCE LINES 94-98
Fine-tuning Compacted Model
---------------------------
Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
Because speed up will replace the masked big layers with dense small ones.
.. GENERATED FROM PYTHON SOURCE LINES 98-102
.. code-block:: default
optimizer = SGD(model.parameters(), 1e-2)
for epoch in range(3):
trainer(model, optimizer, criterion)
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 1 minutes 15.845 seconds)
.. _sphx_glr_download_tutorials_pruning_quick_start_mnist.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: pruning_quick_start_mnist.py <pruning_quick_start_mnist.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: pruning_quick_start_mnist.ipynb <pruning_quick_start_mnist.ipynb>`
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Speed Up Model with Mask\n\n## Introduction\n\nPruning algorithms usually use weight masks to simulate the real pruning. Masks can be used\nto check model performance of a specific pruning (or sparsity), but there is no real speedup.\nSince model speedup is the ultimate goal of model pruning, we try to provide a tool to users\nto convert a model to a smaller one based on user provided masks (the masks come from the\npruning algorithms).\n\nThere are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,\nand input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.\nThe other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.\nTo speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.\nSince the support of sparse kernels in community is limited,\nwe only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.\n\n## Design and Implementation\n\nTo speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,\nor replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,\nthus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.\nTherefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;\nsecond, replace the modules.\n\nThe first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.\nThe new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.\nFor each type of module, we should prepare a function for module replacement.\nThe module replacement function returns a newly created module which is smaller.\n\n## Usage\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a mask for the model at first.\nWe usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.\nBut in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nfrom scripts.compression_mnist_model import TorchModel, device\n\nmodel = TorchModel().to(device)\n# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}\nconv1_mask = torch.ones_like(model.conv1.weight.data)\n# mask the first three output channels in conv1\nconv1_mask[0: 3] = 0\nmasks = {'conv1': {'weight': conv1_mask}}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the original model structure.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Roughly test the original model inference speed.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import time\nstart = time.time()\nmodel(torch.rand(128, 1, 28, 28).to(device))\nprint('Original Model - Elapsed Time : ', time.time() - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Speed up the model and show the model structure after speed up.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nni.compression.pytorch import ModelSpeedup\nModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()\nprint(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Roughly test the model after speed-up inference speed.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"start = time.time()\nmodel(torch.rand(128, 1, 28, 28).to(device))\nprint('Speedup Model - Elapsed Time : ', time.time() - start)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,\nplease refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.\n\nNOTE: The current implementation supports PyTorch 1.3.1 or newer.\n\n## Limitations\n\nFor PyTorch we can only replace modules, if functions in ``forward`` should be replaced,\nour current implementation does not work. One workaround is make the function a PyTorch module.\n\nIf you want to speed up your own model which cannot supported by the current implementation,\nyou need implement the replace function for module replacement, welcome to contribute.\n\n## Speedup Results of Examples\n\nThe code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.\n\nThese result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.\n\n### slim pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - Times\n - Mask Latency\n - Speedup Latency\n * - 1\n - 0.01197\n - 0.005107\n * - 2\n - 0.02019\n - 0.008769\n * - 4\n - 0.02733\n - 0.014809\n * - 8\n - 0.04310\n - 0.027441\n * - 16\n - 0.07731\n - 0.05008\n * - 32\n - 0.14464\n - 0.10027\n\n### fpgm pruner example\n\non cpu,\ninput tensor: ``torch.randn(64, 1, 28, 28)``\\ ,\ntoo large variance\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - Times\n - Mask Latency\n - Speedup Latency\n * - 1\n - 0.01383\n - 0.01839\n * - 2\n - 0.01167\n - 0.003558\n * - 4\n - 0.01636\n - 0.01088\n * - 40\n - 0.14412\n - 0.08268\n * - 40\n - 1.29385\n - 0.14408\n * - 40\n - 0.41035\n - 0.46162\n * - 400\n - 6.29020\n - 5.82143\n\n### l1filter pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - Times\n - Mask Latency\n - Speedup Latency\n * - 1\n - 0.01026\n - 0.003677\n * - 2\n - 0.01657\n - 0.008161\n * - 4\n - 0.02458\n - 0.020018\n * - 8\n - 0.03498\n - 0.025504\n * - 16\n - 0.06757\n - 0.047523\n * - 32\n - 0.10487\n - 0.086442\n\n### APoZ pruner example\n\non one V100 GPU,\ninput tensor: ``torch.randn(64, 3, 32, 32)``\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - Times\n - Mask Latency\n - Speedup Latency\n * - 1\n - 0.01389\n - 0.004208\n * - 2\n - 0.01628\n - 0.008310\n * - 4\n - 0.02521\n - 0.014008\n * - 8\n - 0.03386\n - 0.023923\n * - 16\n - 0.06042\n - 0.046183\n * - 32\n - 0.12421\n - 0.087113\n\n### SimulatedAnnealing pruner example\n\nIn this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.\nWe measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.\nThe latency is measured on one V100 GPU and the input tensor is ``torch.randn(128, 3, 32, 32)``.\n\n<img src=\"file://../../img/SA_latency_accuracy.png\">\n\n### User configuration for ModelSpeedup\n\n**PyTorch**\n\n.. autoclass:: nni.compression.pytorch.ModelSpeedup\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
"""
Speed Up Model with Mask
========================
Introduction
------------
Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
to check model performance of a specific pruning (or sparsity), but there is no real speedup.
Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
to convert a model to a smaller one based on user provided masks (the masks come from the
pruning algorithms).
There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
Since the support of sparse kernels in community is limited,
we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
Design and Implementation
-------------------------
To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
second, replace the modules.
The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
For each type of module, we should prepare a function for module replacement.
The module replacement function returns a newly created module which is smaller.
Usage
-----
"""
# %%
# Generate a mask for the model at first.
# We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
# But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
import torch
from scripts.compression_mnist_model import TorchModel, device
model = TorchModel().to(device)
# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
conv1_mask = torch.ones_like(model.conv1.weight.data)
# mask the first three output channels in conv1
conv1_mask[0: 3] = 0
masks = {'conv1': {'weight': conv1_mask}}
# %%
# Show the original model structure.
print(model)
# %%
# Roughly test the original model inference speed.
import time
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Original Model - Elapsed Time : ', time.time() - start)
# %%
# Speed up the model and show the model structure after speed up.
from nni.compression.pytorch import ModelSpeedup
ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
print(model)
# %%
# Roughly test the model after speed-up inference speed.
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Speedup Model - Elapsed Time : ', time.time() - start)
# %%
# For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
# please refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.
#
# NOTE: The current implementation supports PyTorch 1.3.1 or newer.
#
# Limitations
# -----------
#
# For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
# our current implementation does not work. One workaround is make the function a PyTorch module.
#
# If you want to speed up your own model which cannot supported by the current implementation,
# you need implement the replace function for module replacement, welcome to contribute.
#
# Speedup Results of Examples
# ---------------------------
#
# The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
#
# These result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.
#
# slim pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01197
# - 0.005107
# * - 2
# - 0.02019
# - 0.008769
# * - 4
# - 0.02733
# - 0.014809
# * - 8
# - 0.04310
# - 0.027441
# * - 16
# - 0.07731
# - 0.05008
# * - 32
# - 0.14464
# - 0.10027
#
# fpgm pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on cpu,
# input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
# too large variance
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01383
# - 0.01839
# * - 2
# - 0.01167
# - 0.003558
# * - 4
# - 0.01636
# - 0.01088
# * - 40
# - 0.14412
# - 0.08268
# * - 40
# - 1.29385
# - 0.14408
# * - 40
# - 0.41035
# - 0.46162
# * - 400
# - 6.29020
# - 5.82143
#
# l1filter pruner example
# ^^^^^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01026
# - 0.003677
# * - 2
# - 0.01657
# - 0.008161
# * - 4
# - 0.02458
# - 0.020018
# * - 8
# - 0.03498
# - 0.025504
# * - 16
# - 0.06757
# - 0.047523
# * - 32
# - 0.10487
# - 0.086442
#
# APoZ pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01389
# - 0.004208
# * - 2
# - 0.01628
# - 0.008310
# * - 4
# - 0.02521
# - 0.014008
# * - 8
# - 0.03386
# - 0.023923
# * - 16
# - 0.06042
# - 0.046183
# * - 32
# - 0.12421
# - 0.087113
#
# SimulatedAnnealing pruner example
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
# We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
# The latency is measured on one V100 GPU and the input tensor is ``torch.randn(128, 3, 32, 32)``.
#
# .. image:: ../../img/SA_latency_accuracy.png
#
# User configuration for ModelSpeedup
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# **PyTorch**
#
# .. autoclass:: nni.compression.pytorch.ModelSpeedup
5bcdee7241d8daf931bd76f435167a58
\ No newline at end of file
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/pruning_speed_up.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here <sphx_glr_download_tutorials_pruning_speed_up.py>`
to download the full example code
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_tutorials_pruning_speed_up.py:
Speed Up Model with Mask
========================
Introduction
------------
Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
to check model performance of a specific pruning (or sparsity), but there is no real speedup.
Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
to convert a model to a smaller one based on user provided masks (the masks come from the
pruning algorithms).
There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
Since the support of sparse kernels in community is limited,
we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
Design and Implementation
-------------------------
To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
second, replace the modules.
The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
For each type of module, we should prepare a function for module replacement.
The module replacement function returns a newly created module which is smaller.
Usage
-----
.. GENERATED FROM PYTHON SOURCE LINES 41-44
Generate a mask for the model at first.
We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
.. GENERATED FROM PYTHON SOURCE LINES 44-55
.. code-block:: default
import torch
from scripts.compression_mnist_model import TorchModel, device
model = TorchModel().to(device)
# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
conv1_mask = torch.ones_like(model.conv1.weight.data)
# mask the first three output channels in conv1
conv1_mask[0: 3] = 0
masks = {'conv1': {'weight': conv1_mask}}
.. GENERATED FROM PYTHON SOURCE LINES 56-57
Show the original model structure.
.. GENERATED FROM PYTHON SOURCE LINES 57-59
.. code-block:: default
print(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
TorchModel(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=256, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
.. GENERATED FROM PYTHON SOURCE LINES 60-61
Roughly test the original model inference speed.
.. GENERATED FROM PYTHON SOURCE LINES 61-66
.. code-block:: default
import time
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Original Model - Elapsed Time : ', time.time() - start)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Original Model - Elapsed Time : 0.035959720611572266
.. GENERATED FROM PYTHON SOURCE LINES 67-68
Speed up the model and show the model structure after speed up.
.. GENERATED FROM PYTHON SOURCE LINES 68-72
.. code-block:: default
from nni.compression.pytorch import ModelSpeedup
ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
print(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) start to speed up the model
[2022-02-28 13:29:56] INFO (FixMaskConflict/MainThread) {'conv1': 1, 'conv2': 1}
[2022-02-28 13:29:56] INFO (FixMaskConflict/MainThread) dim0 sparsity: 0.500000
[2022-02-28 13:29:56] INFO (FixMaskConflict/MainThread) dim1 sparsity: 0.000000
[2022-02-28 13:29:56] INFO (FixMaskConflict/MainThread) Dectected conv prune dim" 0
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) infer module masks...
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for conv1
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::relu.5
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::max_pool2d.6
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for conv2
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::relu.7
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::max_pool2d.8
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::flatten.9
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for fc1
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::relu.10
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for fc2
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::relu.11
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for fc3
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update mask for .aten::log_softmax.12
[2022-02-28 13:29:56] ERROR (nni.compression.pytorch.speedup.jit_translate/MainThread) aten::log_softmax is not Supported! Please report an issue at https://github.com/microsoft/nni. Thanks~
[2022-02-28 13:29:56] WARNING (nni.compression.pytorch.speedup.compressor/MainThread) Note: .aten::log_softmax.12 does not have corresponding mask inference object
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the fc3
/home/ningshang/anaconda3/envs/nni-dev/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:417.)
return self._grad
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::relu.11
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the fc2
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::relu.10
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the fc1
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::flatten.9
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::max_pool2d.8
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::relu.7
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the conv2
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::max_pool2d.6
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the .aten::relu.5
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Update the indirect sparsity for the conv1
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) resolve the mask conflict
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace compressed modules...
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace module (name: conv1, op_type: Conv2d)
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::relu.5, op_type: aten::relu) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::max_pool2d.6, op_type: aten::max_pool2d) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace module (name: conv2, op_type: Conv2d)
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::relu.7, op_type: aten::relu) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::max_pool2d.8, op_type: aten::max_pool2d) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::flatten.9, op_type: aten::flatten) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace module (name: fc1, op_type: Linear)
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compress_modules/MainThread) replace linear with new in_features: 256, out_features: 120
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::relu.10, op_type: aten::relu) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace module (name: fc2, op_type: Linear)
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compress_modules/MainThread) replace linear with new in_features: 120, out_features: 84
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::relu.11, op_type: aten::relu) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) replace module (name: fc3, op_type: Linear)
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compress_modules/MainThread) replace linear with new in_features: 84, out_features: 10
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) Warning: cannot replace (name: .aten::log_softmax.12, op_type: aten::log_softmax) which is func type
[2022-02-28 13:29:56] INFO (nni.compression.pytorch.speedup.compressor/MainThread) speedup done
TorchModel(
(conv1): Conv2d(1, 3, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(3, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=256, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
.. GENERATED FROM PYTHON SOURCE LINES 73-74
Roughly test the model after speed-up inference speed.
.. GENERATED FROM PYTHON SOURCE LINES 74-78
.. code-block:: default
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Speedup Model - Elapsed Time : ', time.time() - start)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Speedup Model - Elapsed Time : 0.003432035446166992
.. GENERATED FROM PYTHON SOURCE LINES 79-247
For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
please refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.
NOTE: The current implementation supports PyTorch 1.3.1 or newer.
Limitations
-----------
For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
our current implementation does not work. One workaround is make the function a PyTorch module.
If you want to speed up your own model which cannot supported by the current implementation,
you need implement the replace function for module replacement, welcome to contribute.
Speedup Results of Examples
---------------------------
The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
These result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.
slim pruner example
^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01197
- 0.005107
* - 2
- 0.02019
- 0.008769
* - 4
- 0.02733
- 0.014809
* - 8
- 0.04310
- 0.027441
* - 16
- 0.07731
- 0.05008
* - 32
- 0.14464
- 0.10027
fpgm pruner example
^^^^^^^^^^^^^^^^^^^
on cpu,
input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
too large variance
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01383
- 0.01839
* - 2
- 0.01167
- 0.003558
* - 4
- 0.01636
- 0.01088
* - 40
- 0.14412
- 0.08268
* - 40
- 1.29385
- 0.14408
* - 40
- 0.41035
- 0.46162
* - 400
- 6.29020
- 5.82143
l1filter pruner example
^^^^^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01026
- 0.003677
* - 2
- 0.01657
- 0.008161
* - 4
- 0.02458
- 0.020018
* - 8
- 0.03498
- 0.025504
* - 16
- 0.06757
- 0.047523
* - 32
- 0.10487
- 0.086442
APoZ pruner example
^^^^^^^^^^^^^^^^^^^
on one V100 GPU,
input tensor: ``torch.randn(64, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - Times
- Mask Latency
- Speedup Latency
* - 1
- 0.01389
- 0.004208
* - 2
- 0.01628
- 0.008310
* - 4
- 0.02521
- 0.014008
* - 8
- 0.03386
- 0.023923
* - 16
- 0.06042
- 0.046183
* - 32
- 0.12421
- 0.087113
SimulatedAnnealing pruner example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
The latency is measured on one V100 GPU and the input tensor is ``torch.randn(128, 3, 32, 32)``.
.. image:: ../../img/SA_latency_accuracy.png
User configuration for ModelSpeedup
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**PyTorch**
.. autoclass:: nni.compression.pytorch.ModelSpeedup
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 8.409 seconds)
.. _sphx_glr_download_tutorials_pruning_speed_up.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: pruning_speed_up.py <pruning_speed_up.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: pruning_speed_up.ipynb <pruning_speed_up.ipynb>`
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Quantization Quickstart\n\nQuantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.\n\nIn NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.\nHere we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparation\n\nIn this tutorial, we use a simple model and pre-train on MNIST dataset.\nIf you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\n\nfrom scripts.compression_mnist_model import TorchModel, trainer, evaluator, device\n\n# define the model\nmodel = TorchModel().to(device)\n\n# define the optimizer and criterion for pre-training\n\noptimizer = SGD(model.parameters(), 1e-2)\ncriterion = F.nll_loss\n\n# pre-train and evaluate the model on MNIST dataset\nfor epoch in range(3):\n trainer(model, optimizer, criterion)\n evaluator(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quantizing Model\n\nInitialize a `config_list`.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"config_list = [{\n 'quant_types': ['input', 'weight'],\n 'quant_bits': {'input': 8, 'weight': 8},\n 'op_names': ['conv1']\n}, {\n 'quant_types': ['output'],\n 'quant_bits': {'output': 8},\n 'op_names': ['relu1']\n}, {\n 'quant_types': ['input', 'weight'],\n 'quant_bits': {'input': 8, 'weight': 8},\n 'op_names': ['conv2']\n}, {\n 'quant_types': ['output'],\n 'quant_bits': {'output': 8},\n 'op_names': ['relu2']\n}]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"finetuning the model by using QAT\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer\ndummy_input = torch.rand(32, 1, 28, 28).to(device)\nquantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)\nquantizer.compress()\nfor epoch in range(3):\n trainer(model, optimizer, criterion)\n evaluator(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"export model and get calibration_config\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model_path = \"./log/mnist_model.pth\"\ncalibration_path = \"./log/mnist_calibration.pth\"\ncalibration_config = quantizer.export_model(model_path, calibration_path)\n\nprint(\"calibration_config: \", calibration_config)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
"""
Quantization Quickstart
=======================
Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
"""
# %%
# Preparation
# -----------
#
# In this tutorial, we use a simple model and pre-train on MNIST dataset.
# If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# Quantizing Model
# ----------------
#
# Initialize a `config_list`.
config_list = [{
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv1']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu1']
}, {
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv2']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
# %%
# finetuning the model by using QAT
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
dummy_input = torch.rand(32, 1, 28, 28).to(device)
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# export model and get calibration_config
model_path = "./log/mnist_model.pth"
calibration_path = "./log/mnist_calibration.pth"
calibration_config = quantizer.export_model(model_path, calibration_path)
print("calibration_config: ", calibration_config)
bcaf7880c66acfb20f3e5425730e21de
\ No newline at end of file
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/quantization_quick_start_mnist.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here <sphx_glr_download_tutorials_quantization_quick_start_mnist.py>`
to download the full example code
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_tutorials_quantization_quick_start_mnist.py:
Quantization Quickstart
=======================
Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
.. GENERATED FROM PYTHON SOURCE LINES 12-17
Preparation
-----------
In this tutorial, we use a simple model and pre-train on MNIST dataset.
If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
.. GENERATED FROM PYTHON SOURCE LINES 17-37
.. code-block:: default
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Average test loss: 0.4891, Accuracy: 8504/10000 (85%)
Average test loss: 0.2644, Accuracy: 9179/10000 (92%)
Average test loss: 0.1953, Accuracy: 9414/10000 (94%)
.. GENERATED FROM PYTHON SOURCE LINES 38-42
Quantizing Model
----------------
Initialize a `config_list`.
.. GENERATED FROM PYTHON SOURCE LINES 42-61
.. code-block:: default
config_list = [{
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv1']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu1']
}, {
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv2']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
.. GENERATED FROM PYTHON SOURCE LINES 62-63
finetuning the model by using QAT
.. GENERATED FROM PYTHON SOURCE LINES 63-71
.. code-block:: default
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
dummy_input = torch.rand(32, 1, 28, 28).to(device)
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Average test loss: 0.1421, Accuracy: 9567/10000 (96%)
Average test loss: 0.1180, Accuracy: 9621/10000 (96%)
Average test loss: 0.1119, Accuracy: 9649/10000 (96%)
.. GENERATED FROM PYTHON SOURCE LINES 72-73
export model and get calibration_config
.. GENERATED FROM PYTHON SOURCE LINES 73-78
.. code-block:: default
model_path = "./log/mnist_model.pth"
calibration_path = "./log/mnist_calibration.pth"
calibration_config = quantizer.export_model(model_path, calibration_path)
print("calibration_config: ", calibration_config)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
calibration_config: {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0034], device='cuda:0'), 'weight_zero_point': tensor([71.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0020], device='cuda:0'), 'weight_zero_point': tensor([112.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 13.904684066772461}}
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 1 minutes 25.558 seconds)
.. _sphx_glr_download_tutorials_quantization_quick_start_mnist.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: quantization_quick_start_mnist.py <quantization_quick_start_mnist.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: quantization_quick_start_mnist.ipynb <quantization_quick_start_mnist.ipynb>`
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment