Unverified Commit db3130d7 authored by J-shang's avatar J-shang Committed by GitHub
Browse files

[Doc] Compression (#4574)

parent cef9babd
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Speed Up Model with Calibration Config\n\n\n## Introduction\n\nDeep learning network has been computational intensive and memory intensive \nwhich increases the difficulty of deploying deep neural network model. Quantization is a \nfundamental technology which is widely used to reduce memory footprint and speed up inference \nprocess. Many frameworks begin to support quantization, but few of them support mixed precision \nquantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\\, only support simulated mixed precision quantization which will \nnot speed up the inference process. To get real speedup of mixed precision quantization and \nhelp people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different \nDL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model \nwith quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects \nTensorRT at this stage, and will support more backends in the future.\n\n\n## Design and Implementation\n\nTo support speeding up mixed precision quantization, we divide framework into two part, frontend and backend. \nFrontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference \nframework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and \nTensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph \nrepresentation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx \nmodel to generate inference engine. \n\n\nQuantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.\nUsers should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\\ ).\nAfter quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.\n\n\nAfter getting mixed precision engine, users can do inference with input data.\n\n\nNote\n\n\n* Recommend using \"cpu\"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not \"cpu\"(host), this tool will transpose it to \"cpu\" which may increases unnecessary overhead.\n* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).\n* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.\n\n\n## Prerequisite\nCUDA version >= 11.0\n\nTensorRT version >= 7.2\n\nNote\n\n* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\\ \n\n## Usage\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nimport torch.nn.functional as F\nfrom torch.optim import SGD\nfrom scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt\n\nconfig_list = [{\n 'quant_types': ['input', 'weight'],\n 'quant_bits': {'input': 8, 'weight': 8},\n 'op_names': ['conv1']\n}, {\n 'quant_types': ['output'],\n 'quant_bits': {'output': 8},\n 'op_names': ['relu1']\n}, {\n 'quant_types': ['input', 'weight'],\n 'quant_bits': {'input': 8, 'weight': 8},\n 'op_names': ['conv2']\n}, {\n 'quant_types': ['output'],\n 'quant_bits': {'output': 8},\n 'op_names': ['relu2']\n}]\n\nmodel = TorchModel().to(device)\noptimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)\ncriterion = F.nll_loss\ndummy_input = torch.rand(32, 1, 28,28).to(device)\n\nfrom nni.algorithms.compression.pytorch.quantization import QAT_Quantizer\nquantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)\nquantizer.compress()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"finetuning the model by using QAT\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for epoch in range(3):\n trainer(model, optimizer, criterion)\n evaluator(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"export model and get calibration_config\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model_path = \"./log/mnist_model.pth\"\ncalibration_path = \"./log/mnist_calibration.pth\"\ncalibration_config = quantizer.export_model(model_path, calibration_path)\n\nprint(\"calibration_config: \", calibration_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"build tensorRT engine to make a real speed up\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT\n# input_shape = (32, 1, 28, 28)\n# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)\n# engine.compress()\n# test_trt(engine)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.\n\nFor complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.\n\nFor more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\\.\n\n### Mnist test\n\non one GTX2080 GPU,\ninput tensor: ``torch.randn(128, 1, 28, 28)``\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - quantization strategy\n - Latency\n - accuracy\n * - all in 32bit\n - 0.001199961\n - 96%\n * - mixed precision(average bit 20.4)\n - 0.000753688\n - 96%\n * - all in 8bit\n - 0.000229869\n - 93.7%\n\n### Cifar10 resnet18 test (train one epoch)\n\non one GTX2080 GPU,\ninput tensor: ``torch.randn(128, 3, 32, 32)``\n\n.. list-table::\n :header-rows: 1\n :widths: auto\n\n * - quantization strategy\n - Latency\n - accuracy\n * - all in 32bit\n - 0.003286268\n - 54.21%\n * - mixed precision(average bit 11.55)\n - 0.001358022\n - 54.78%\n * - all in 8bit\n - 0.000859139\n - 52.81%\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
Speed up Mixed Precision Quantization Model (experimental) """
========================================================== Speed Up Model with Calibration Config
======================================
Introduction Introduction
...@@ -56,87 +57,114 @@ Note ...@@ -56,87 +57,114 @@ Note
Usage Usage
----- -----
quantization aware training:
.. code-block:: python """
# arrange bit config for QAT algorithm # %%
configure_list = [{ import torch
'quant_types': ['weight', 'output'], import torch.nn.functional as F
'quant_bits': {'weight':8, 'output':8}, from torch.optim import SGD
'op_names': ['conv1'] from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
}, {
'quant_types': ['output'], config_list = [{
'quant_bits': {'output':8}, 'quant_types': ['input', 'weight'],
'op_names': ['relu1'] 'quant_bits': {'input': 8, 'weight': 8},
} 'op_names': ['conv1']
] }, {
'quant_types': ['output'],
quantizer = QAT_Quantizer(model, configure_list, optimizer) 'quant_bits': {'output': 8},
quantizer.compress() 'op_names': ['relu1']
calibration_config = quantizer.export_model(model_path, calibration_path) }, {
'quant_types': ['input', 'weight'],
engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=batch_size) 'quant_bits': {'input': 8, 'weight': 8},
# build tensorrt inference engine 'op_names': ['conv2']
engine.compress() }, {
# data should be pytorch tensor 'quant_types': ['output'],
output, time = engine.inference(data) 'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
model = TorchModel().to(device)
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`. criterion = F.nll_loss
dummy_input = torch.rand(32, 1, 28,28).to(device)
For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\. from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
Mnist test
^^^^^^^^^^^^^^^^^^^ # %%
# finetuning the model by using QAT
on one GTX2080 GPU, for epoch in range(3):
input tensor: ``torch.randn(128, 1, 28, 28)`` trainer(model, optimizer, criterion)
evaluator(model)
.. list-table::
:header-rows: 1 # %%
:widths: auto # export model and get calibration_config
model_path = "./log/mnist_model.pth"
* - quantization strategy calibration_path = "./log/mnist_calibration.pth"
- Latency calibration_config = quantizer.export_model(model_path, calibration_path)
- accuracy
* - all in 32bit print("calibration_config: ", calibration_config)
- 0.001199961
- 96% # %%
* - mixed precision(average bit 20.4) # build tensorRT engine to make a real speed up
- 0.000753688
- 96% # from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
* - all in 8bit # input_shape = (32, 1, 28, 28)
- 0.000229869 # engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
- 93.7% # engine.compress()
# test_trt(engine)
Cifar10 resnet18 test(train one epoch) # %%
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
#
# For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
on one GTX2080 GPU, #
input tensor: ``torch.randn(128, 3, 32, 32)`` # For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
#
# Mnist test
.. list-table:: # ^^^^^^^^^^
:header-rows: 1 #
:widths: auto # on one GTX2080 GPU,
# input tensor: ``torch.randn(128, 1, 28, 28)``
* - quantization strategy #
- Latency # .. list-table::
- accuracy # :header-rows: 1
* - all in 32bit # :widths: auto
- 0.003286268 #
- 54.21% # * - quantization strategy
* - mixed precision(average bit 11.55) # - Latency
- 0.001358022 # - accuracy
- 54.78% # * - all in 32bit
* - all in 8bit # - 0.001199961
- 0.000859139 # - 96%
- 52.81% # * - mixed precision(average bit 20.4)
\ No newline at end of file # - 0.000753688
# - 96%
# * - all in 8bit
# - 0.000229869
# - 93.7%
#
# Cifar10 resnet18 test (train one epoch)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# on one GTX2080 GPU,
# input tensor: ``torch.randn(128, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - quantization strategy
# - Latency
# - accuracy
# * - all in 32bit
# - 0.003286268
# - 54.21%
# * - mixed precision(average bit 11.55)
# - 0.001358022
# - 54.78%
# * - all in 8bit
# - 0.000859139
# - 52.81%
07fe95336a0d7edb8924dc24b609b361
\ No newline at end of file
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/quantization_speed_up.py"
.. LINE NUMBERS ARE GIVEN BELOW.
.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here <sphx_glr_download_tutorials_quantization_speed_up.py>`
to download the full example code
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_tutorials_quantization_speed_up.py:
Speed Up Model with Calibration Config
======================================
Introduction
------------
Deep learning network has been computational intensive and memory intensive
which increases the difficulty of deploying deep neural network model. Quantization is a
fundamental technology which is widely used to reduce memory footprint and speed up inference
process. Many frameworks begin to support quantization, but few of them support mixed precision
quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will
not speed up the inference process. To get real speedup of mixed precision quantization and
help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different
DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model
with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects
TensorRT at this stage, and will support more backends in the future.
Design and Implementation
-------------------------
To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.
Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference
framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and
TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph
representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx
model to generate inference engine.
Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\ ).
After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
After getting mixed precision engine, users can do inference with input data.
Note
* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
Prerequisite
------------
CUDA version >= 11.0
TensorRT version >= 7.2
Note
* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\
Usage
-----
.. GENERATED FROM PYTHON SOURCE LINES 64-96
.. code-block:: default
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
config_list = [{
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv1']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu1']
}, {
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv2']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
model = TorchModel().to(device)
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = F.nll_loss
dummy_input = torch.rand(32, 1, 28,28).to(device)
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
[2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu1'] not found in model
[2022-02-21 18:53:07] WARNING (nni.algorithms.compression.pytorch.quantization.qat_quantizer/MainThread) op_names ['relu2'] not found in model
TorchModel(
(conv1): QuantizerModuleWrapper(
(module): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
)
(conv2): QuantizerModuleWrapper(
(module): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
)
(fc1): Linear(in_features=256, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
.. GENERATED FROM PYTHON SOURCE LINES 97-98
finetuning the model by using QAT
.. GENERATED FROM PYTHON SOURCE LINES 98-102
.. code-block:: default
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Average test loss: 0.2524, Accuracy: 9209/10000 (92%)
Average test loss: 0.1711, Accuracy: 9461/10000 (95%)
Average test loss: 0.1037, Accuracy: 9690/10000 (97%)
.. GENERATED FROM PYTHON SOURCE LINES 103-104
export model and get calibration_config
.. GENERATED FROM PYTHON SOURCE LINES 104-110
.. code-block:: default
model_path = "./log/mnist_model.pth"
calibration_path = "./log/mnist_calibration.pth"
calibration_config = quantizer.export_model(model_path, calibration_path)
print("calibration_config: ", calibration_config)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
[2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Model state_dict saved to ./log/mnist_model.pth
[2022-02-21 18:53:54] INFO (nni.compression.pytorch.compressor/MainThread) Mask dict saved to ./log/mnist_calibration.pth
calibration_config: {'conv1': {'weight_bits': 8, 'weight_scale': tensor([0.0026], device='cuda:0'), 'weight_zero_point': tensor([103.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': -0.4242129623889923, 'tracked_max_input': 2.821486711502075}, 'conv2': {'weight_bits': 8, 'weight_scale': tensor([0.0019], device='cuda:0'), 'weight_zero_point': tensor([116.], device='cuda:0'), 'input_bits': 8, 'tracked_min_input': 0.0, 'tracked_max_input': 10.175512313842773}}
.. GENERATED FROM PYTHON SOURCE LINES 111-112
build tensorRT engine to make a real speed up
.. GENERATED FROM PYTHON SOURCE LINES 112-119
.. code-block:: default
# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
# input_shape = (32, 1, 28, 28)
# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
# engine.compress()
# test_trt(engine)
.. GENERATED FROM PYTHON SOURCE LINES 120-171
Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
Mnist test
^^^^^^^^^^
on one GTX2080 GPU,
input tensor: ``torch.randn(128, 1, 28, 28)``
.. list-table::
:header-rows: 1
:widths: auto
* - quantization strategy
- Latency
- accuracy
* - all in 32bit
- 0.001199961
- 96%
* - mixed precision(average bit 20.4)
- 0.000753688
- 96%
* - all in 8bit
- 0.000229869
- 93.7%
Cifar10 resnet18 test (train one epoch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
on one GTX2080 GPU,
input tensor: ``torch.randn(128, 3, 32, 32)``
.. list-table::
:header-rows: 1
:widths: auto
* - quantization strategy
- Latency
- accuracy
* - all in 32bit
- 0.003286268
- 54.21%
* - mixed precision(average bit 11.55)
- 0.001358022
- 54.78%
* - all in 8bit
- 0.000859139
- 52.81%
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 52.798 seconds)
.. _sphx_glr_download_tutorials_quantization_speed_up.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: quantization_speed_up.py <quantization_speed_up.py>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: quantization_speed_up.ipynb <quantization_speed_up.ipynb>`
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
...@@ -5,12 +5,20 @@ ...@@ -5,12 +5,20 @@
Computation times Computation times
================= =================
**04:06.818** total execution time for **tutorials** files: **00:08.409** total execution time for **tutorials** files:
+-------------------------------------------------------------------------------+-----------+--------+ +-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``) | 04:06.818 | 0.0 MB | | :ref:`sphx_glr_tutorials_pruning_speed_up.py` (``pruning_speed_up.py``) | 00:08.409 | 0.0 MB |
+-------------------------------------------------------------------------------+-----------+--------+ +-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``) | 00:00.000 | 0.0 MB | | :ref:`sphx_glr_tutorials_hello_nas.py` (``hello_nas.py``) | 00:00.000 | 0.0 MB |
+-------------------------------------------------------------------------------+-----------+--------+ +-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_nni_experiment.py` (``nni_experiment.py``) | 00:00.000 | 0.0 MB | | :ref:`sphx_glr_tutorials_nasbench_as_dataset.py` (``nasbench_as_dataset.py``) | 00:00.000 | 0.0 MB |
+-------------------------------------------------------------------------------+-----------+--------+ +-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_nni_experiment.py` (``nni_experiment.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_pruning_quick_start_mnist.py` (``pruning_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_quantization_quick_start_mnist.py` (``quantization_quick_start_mnist.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
| :ref:`sphx_glr_tutorials_quantization_speed_up.py` (``quantization_speed_up.py``) | 00:00.000 | 0.0 MB |
+-----------------------------------------------------------------------------------------------------+-----------+--------+
"""
Pruning Quickstart
==================
Model pruning is a technique to reduce the model size and computation by reducing model weight size or intermediate state size.
It usually has following paths:
#. Pre-training a model -> Pruning the model -> Fine-tuning the model
#. Pruning the model aware training -> Fine-tuning the model
#. Pruning the model -> Pre-training the compact model
NNI supports the above three modes and mainly focuses on the pruning stage.
Follow this tutorial for a quick look at how to use NNI to prune a model in a common practice.
"""
# %%
# Preparation
# -----------
#
# In this tutorial, we use a simple model and pre-train on MNIST dataset.
# If you are familiar with defining a model and training in pytorch, you can skip directly to `Pruning Model`_.
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# Pruning Model
# -------------
#
# Using L1NormPruner pruning the model and generating the masks.
# Usually, pruners require original model and ``config_list`` as parameters.
# Detailed about how to write ``config_list`` please refer ...
#
# This `config_list` means all layers whose type is `Linear` or `Conv2d` will be pruned,
# except the layer named `fc3`, because `fc3` is `exclude`.
# The final sparsity ratio for each layer is 50%. The layer named `fc3` will not be pruned.
config_list = [{
'sparsity_per_layer': 0.5,
'op_types': ['Linear', 'Conv2d']
}, {
'exclude': True,
'op_names': ['fc3']
}]
# %%
# Pruners usually require `model` and `config_list` as input arguments.
from nni.algorithms.compression.v2.pytorch.pruning import L1NormPruner
pruner = L1NormPruner(model, config_list)
# show the wrapped model structure
print(model)
# compress the model and generate the masks
_, masks = pruner.compress()
# show the masks sparsity
for name, mask in masks.items():
print(name, ' sparsity: ', '{:.2}'.format(mask['weight'].sum() / mask['weight'].numel()))
# %%
# Speed up the original model with masks, note that `ModelSpeedup` requires an unwrapped model.
# The model becomes smaller after speed-up,
# and reaches a higher sparsity ratio because `ModelSpeedup` will propagate the masks across layers.
# need to unwrap the model, if the model is wrapped before speed up
pruner._unwrap_model()
# speed up the model
from nni.compression.pytorch.speedup import ModelSpeedup
ModelSpeedup(model, torch.rand(3, 1, 28, 28).to(device), masks).speedup_model()
# %%
# the model will become real smaller after speed up
print(model)
# %%
# Fine-tuning Compacted Model
# ---------------------------
# Note that if the model has been sped up, you need to re-initialize a new optimizer for fine-tuning.
# Because speed up will replace the masked big layers with dense small ones.
optimizer = SGD(model.parameters(), 1e-2)
for epoch in range(3):
trainer(model, optimizer, criterion)
"""
Speed Up Model with Mask
========================
Introduction
------------
Pruning algorithms usually use weight masks to simulate the real pruning. Masks can be used
to check model performance of a specific pruning (or sparsity), but there is no real speedup.
Since model speedup is the ultimate goal of model pruning, we try to provide a tool to users
to convert a model to a smaller one based on user provided masks (the masks come from the
pruning algorithms).
There are two types of pruning. One is fine-grained pruning, it does not change the shape of weights,
and input/output tensors. Sparse kernel is required to speed up a fine-grained pruned layer.
The other is coarse-grained pruning (e.g., channels), shape of weights and input/output tensors usually change due to such pruning.
To speed up this kind of pruning, there is no need to use sparse kernel, just replace the pruned layer with smaller one.
Since the support of sparse kernels in community is limited,
we only support the speedup of coarse-grained pruning and leave the support of fine-grained pruning in future.
Design and Implementation
-------------------------
To speed up a model, the pruned layers should be replaced, either replaced with smaller layer for coarse-grained mask,
or replaced with sparse kernel for fine-grained mask. Coarse-grained mask usually changes the shape of weights or input/output tensors,
thus, we should do shape inference to check are there other unpruned layers should be replaced as well due to shape change.
Therefore, in our design, there are two main steps: first, do shape inference to find out all the modules that should be replaced;
second, replace the modules.
The first step requires topology (i.e., connections) of the model, we use ``jit.trace`` to obtain the model graph for PyTorch.
The new shape of module is auto-inference by NNI, the unchanged parts of outputs during forward and inputs during backward are prepared for reduct.
For each type of module, we should prepare a function for module replacement.
The module replacement function returns a newly created module which is smaller.
Usage
-----
"""
# %%
# Generate a mask for the model at first.
# We usually use a NNI pruner to generate the masks then use ``ModelSpeedup`` to compact the model.
# But in fact ``ModelSpeedup`` is a relatively independent tool, so you can use it independently.
import torch
from scripts.compression_mnist_model import TorchModel, device
model = TorchModel().to(device)
# masks = {layer_name: {'weight': weight_mask, 'bias': bias_mask}}
conv1_mask = torch.ones_like(model.conv1.weight.data)
# mask the first three output channels in conv1
conv1_mask[0: 3] = 0
masks = {'conv1': {'weight': conv1_mask}}
# %%
# Show the original model structure.
print(model)
# %%
# Roughly test the original model inference speed.
import time
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Original Model - Elapsed Time : ', time.time() - start)
# %%
# Speed up the model and show the model structure after speed up.
from nni.compression.pytorch import ModelSpeedup
ModelSpeedup(model, torch.rand(10, 1, 28, 28).to(device), masks).speedup_model()
print(model)
# %%
# Roughly test the model after speed-up inference speed.
start = time.time()
model(torch.rand(128, 1, 28, 28).to(device))
print('Speedup Model - Elapsed Time : ', time.time() - start)
# %%
# For combining usage of ``Pruner`` masks generation with ``ModelSpeedup``,
# please refer to `Pruning Quick Start <./pruning_quick_start_mnist.html>`__.
#
# NOTE: The current implementation supports PyTorch 1.3.1 or newer.
#
# Limitations
# -----------
#
# For PyTorch we can only replace modules, if functions in ``forward`` should be replaced,
# our current implementation does not work. One workaround is make the function a PyTorch module.
#
# If you want to speed up your own model which cannot supported by the current implementation,
# you need implement the replace function for module replacement, welcome to contribute.
#
# Speedup Results of Examples
# ---------------------------
#
# The code of these experiments can be found :githublink:`here <examples/model_compress/pruning/speedup/model_speedup.py>`.
#
# These result are tested on the `legacy pruning framework <../comporession/pruning_legacy>`__, new results will coming soon.
#
# slim pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01197
# - 0.005107
# * - 2
# - 0.02019
# - 0.008769
# * - 4
# - 0.02733
# - 0.014809
# * - 8
# - 0.04310
# - 0.027441
# * - 16
# - 0.07731
# - 0.05008
# * - 32
# - 0.14464
# - 0.10027
#
# fpgm pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on cpu,
# input tensor: ``torch.randn(64, 1, 28, 28)``\ ,
# too large variance
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01383
# - 0.01839
# * - 2
# - 0.01167
# - 0.003558
# * - 4
# - 0.01636
# - 0.01088
# * - 40
# - 0.14412
# - 0.08268
# * - 40
# - 1.29385
# - 0.14408
# * - 40
# - 0.41035
# - 0.46162
# * - 400
# - 6.29020
# - 5.82143
#
# l1filter pruner example
# ^^^^^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01026
# - 0.003677
# * - 2
# - 0.01657
# - 0.008161
# * - 4
# - 0.02458
# - 0.020018
# * - 8
# - 0.03498
# - 0.025504
# * - 16
# - 0.06757
# - 0.047523
# * - 32
# - 0.10487
# - 0.086442
#
# APoZ pruner example
# ^^^^^^^^^^^^^^^^^^^
#
# on one V100 GPU,
# input tensor: ``torch.randn(64, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - Times
# - Mask Latency
# - Speedup Latency
# * - 1
# - 0.01389
# - 0.004208
# * - 2
# - 0.01628
# - 0.008310
# * - 4
# - 0.02521
# - 0.014008
# * - 8
# - 0.03386
# - 0.023923
# * - 16
# - 0.06042
# - 0.046183
# * - 32
# - 0.12421
# - 0.087113
#
# SimulatedAnnealing pruner example
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# In this experiment, we use SimulatedAnnealing pruner to prune the resnet18 on the cifar10 dataset.
# We measure the latencies and accuracies of the pruned model under different sparsity ratios, as shown in the following figure.
# The latency is measured on one V100 GPU and the input tensor is ``torch.randn(128, 3, 32, 32)``.
#
# .. image:: ../../img/SA_latency_accuracy.png
#
# User configuration for ModelSpeedup
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# **PyTorch**
#
# .. autoclass:: nni.compression.pytorch.ModelSpeedup
"""
Quantization Quickstart
=======================
Quantization reduces model size and speeds up inference time by reducing the number of bits required to represent weights or activations.
In NNI, both post-training quantization algorithms and quantization-aware training algorithms are supported.
Here we use `QAT_Quantizer` as an example to show the usage of quantization in NNI.
"""
# %%
# Preparation
# -----------
#
# In this tutorial, we use a simple model and pre-train on MNIST dataset.
# If you are familiar with defining a model and training in pytorch, you can skip directly to `Quantizing Model`_.
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, trainer, evaluator, device
# define the model
model = TorchModel().to(device)
# define the optimizer and criterion for pre-training
optimizer = SGD(model.parameters(), 1e-2)
criterion = F.nll_loss
# pre-train and evaluate the model on MNIST dataset
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# Quantizing Model
# ----------------
#
# Initialize a `config_list`.
config_list = [{
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv1']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu1']
}, {
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv2']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
# %%
# finetuning the model by using QAT
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
dummy_input = torch.rand(32, 1, 28, 28).to(device)
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# export model and get calibration_config
model_path = "./log/mnist_model.pth"
calibration_path = "./log/mnist_calibration.pth"
calibration_config = quantizer.export_model(model_path, calibration_path)
print("calibration_config: ", calibration_config)
"""
Speed Up Model with Calibration Config
======================================
Introduction
------------
Deep learning network has been computational intensive and memory intensive
which increases the difficulty of deploying deep neural network model. Quantization is a
fundamental technology which is widely used to reduce memory footprint and speed up inference
process. Many frameworks begin to support quantization, but few of them support mixed precision
quantization and get real speedup. Frameworks like `HAQ: Hardware-Aware Automated Quantization with Mixed Precision <https://arxiv.org/pdf/1811.08886.pdf>`__\, only support simulated mixed precision quantization which will
not speed up the inference process. To get real speedup of mixed precision quantization and
help people get the real feedback from hardware, we design a general framework with simple interface to allow NNI quantization algorithms to connect different
DL model optimization backends (e.g., TensorRT, NNFusion), which gives users an end-to-end experience that after quantizing their model
with quantization algorithms, the quantized model can be directly speeded up with the connected optimization backend. NNI connects
TensorRT at this stage, and will support more backends in the future.
Design and Implementation
-------------------------
To support speeding up mixed precision quantization, we divide framework into two part, frontend and backend.
Frontend could be popular training frameworks such as PyTorch, TensorFlow etc. Backend could be inference
framework for different hardwares, such as TensorRT. At present, we support PyTorch as frontend and
TensorRT as backend. To convert PyTorch model to TensorRT engine, we leverage onnx as intermediate graph
representation. In this way, we convert PyTorch model to onnx model, then TensorRT parse onnx
model to generate inference engine.
Quantization aware training combines NNI quantization algorithm 'QAT' and NNI quantization speedup tool.
Users should set config to train quantized model using QAT algorithm(please refer to `NNI Quantization Algorithms <https://nni.readthedocs.io/en/stable/Compression/Quantizer.html>`__\ ).
After quantization aware training, users can get new config with calibration parameters and model with quantized weight. By passing new config and model to quantization speedup tool, users can get real mixed precision speedup engine to do inference.
After getting mixed precision engine, users can do inference with input data.
Note
* Recommend using "cpu"(host) as data device(for both inference data and calibration data) since data should be on host initially and it will be transposed to device before inference. If data type is not "cpu"(host), this tool will transpose it to "cpu" which may increases unnecessary overhead.
* User can also do post-training quantization leveraging TensorRT directly(need to provide calibration dataset).
* Not all op types are supported right now. At present, NNI supports Conv, Linear, Relu and MaxPool. More op types will be supported in the following release.
Prerequisite
------------
CUDA version >= 11.0
TensorRT version >= 7.2
Note
* If you haven't installed TensorRT before or use the old version, please refer to `TensorRT Installation Guide <https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html>`__\
Usage
-----
"""
# %%
import torch
import torch.nn.functional as F
from torch.optim import SGD
from scripts.compression_mnist_model import TorchModel, device, trainer, evaluator, test_trt
config_list = [{
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv1']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu1']
}, {
'quant_types': ['input', 'weight'],
'quant_bits': {'input': 8, 'weight': 8},
'op_names': ['conv2']
}, {
'quant_types': ['output'],
'quant_bits': {'output': 8},
'op_names': ['relu2']
}]
model = TorchModel().to(device)
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = F.nll_loss
dummy_input = torch.rand(32, 1, 28,28).to(device)
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input)
quantizer.compress()
# %%
# finetuning the model by using QAT
for epoch in range(3):
trainer(model, optimizer, criterion)
evaluator(model)
# %%
# export model and get calibration_config
model_path = "./log/mnist_model.pth"
calibration_path = "./log/mnist_calibration.pth"
calibration_config = quantizer.export_model(model_path, calibration_path)
print("calibration_config: ", calibration_config)
# %%
# build tensorRT engine to make a real speed up
# from nni.compression.pytorch.quantization_speedup import ModelSpeedupTensorRT
# input_shape = (32, 1, 28, 28)
# engine = ModelSpeedupTensorRT(model, input_shape, config=calibration_config, batchsize=32)
# engine.compress()
# test_trt(engine)
# %%
# Note that NNI also supports post-training quantization directly, please refer to complete examples for detail.
#
# For complete examples please refer to :githublink:`the code <examples/model_compress/quantization/mixed_precision_speedup_mnist.py>`.
#
# For more parameters about the class 'TensorRTModelSpeedUp', you can refer to `Model Compression API Reference <https://nni.readthedocs.io/en/stable/Compression/CompressionReference.html#quantization-speedup>`__\.
#
# Mnist test
# ^^^^^^^^^^
#
# on one GTX2080 GPU,
# input tensor: ``torch.randn(128, 1, 28, 28)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - quantization strategy
# - Latency
# - accuracy
# * - all in 32bit
# - 0.001199961
# - 96%
# * - mixed precision(average bit 20.4)
# - 0.000753688
# - 96%
# * - all in 8bit
# - 0.000229869
# - 93.7%
#
# Cifar10 resnet18 test (train one epoch)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# on one GTX2080 GPU,
# input tensor: ``torch.randn(128, 3, 32, 32)``
#
# .. list-table::
# :header-rows: 1
# :widths: auto
#
# * - quantization strategy
# - Latency
# - accuracy
# * - all in 32bit
# - 0.003286268
# - 54.21%
# * - mixed precision(average bit 11.55)
# - 0.001358022
# - 54.78%
# * - all in 8bit
# - 0.000859139
# - 52.81%
from pathlib import Path
root_path = Path(__file__).parent.parent
# define the model
import torch
from torch import nn
from torch.nn import functional as F
class TorchModel(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 6, 5, 1)
self.conv2 = nn.Conv2d(6, 16, 5, 1)
self.fc1 = nn.Linear(16 * 4 * 4, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
x = torch.flatten(x, 1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x, dim=1)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
# load data
from torchvision import datasets, transforms
train_loader = torch.utils.data.DataLoader(
datasets.MNIST(root_path / 'data', train=True, download=True, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST(root_path / 'data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=1000, shuffle=True)
# define the trainer and evaluator
def trainer(model, optimizer, criterion):
# training the model
model.train()
for data, target in train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
def evaluator(model):
# evaluating the model accuracy and average test loss
model.eval()
test_loss = 0
correct = 0
test_dataset_length = len(test_loader.dataset)
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
# sum up batch loss
test_loss += F.nll_loss(output, target, reduction='sum').item()
# get the index of the max log-probability
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= test_dataset_length
accuracy = 100. * correct / test_dataset_length
print('Average test loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(test_loss, correct, test_dataset_length, accuracy))
def test_trt(engine):
test_loss = 0
correct = 0
time_elasped = 0
for data, target in test_loader:
output, time = engine.inference(data)
test_loss += F.nll_loss(output, target, reduction='sum').item()
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
time_elasped += time
test_loss /= len(test_loader.dataset)
print('Loss: {} Accuracy: {}%'.format(
test_loss, 100 * correct / len(test_loader.dataset)))
print("Inference elapsed_time (whole dataset): {}s".format(time_elasped))
...@@ -22,9 +22,74 @@ class ClipGrad(QuantGrad): ...@@ -22,9 +22,74 @@ class ClipGrad(QuantGrad):
class BNNQuantizer(Quantizer): class BNNQuantizer(Quantizer):
"""Binarized Neural Networks, as defined in: r"""
Binarized Neural Networks: Training Deep Neural Networks with Weights and Outputs Constrained to +1 or -1 Binarized Neural Networks, as defined in:
(https://arxiv.org/abs/1602.02830) `Binarized Neural Networks: Training Deep Neural Networks with Weights and
Activations Constrained to +1 or -1 <https://arxiv.org/abs/1602.02830>`__\ ,
..
We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time.
At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass,
BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations,
which is expected to substantially improve power-efficiency.
Parameters
----------
model : torch.nn.Module
Model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys for dict:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
When the type is int, all quantization types share same bits length.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
optimizer : torch.optim.Optimizer
Optimizer is required in `BNNQuantizer`, NNI will patch the optimizer and count the optimize step number.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import BNNQuantizer
>>> model = ...
>>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
>>> optimizer = ...
>>> quantizer = BNNQuantizer(model, config_list, optimizer)
>>> quantizer.compress()
>>> # Training Process...
For detailed example please refer to
:githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
<examples/model_compress/quantization/BNN_quantizer_cifar10.py>`.
Notes
-----
**Results**
We implemented one of the experiments in
`Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
<https://arxiv.org/abs/1602.02830>`__,
we quantized the **VGGNet** for CIFAR-10 in the paper. Our experiments results are as follows:
.. list-table::
:header-rows: 1
:widths: auto
* - Model
- Accuracy
* - VGGNet
- 86.93%
The experiments code can be found at
:githublink:`examples/model_compress/quantization/BNN_quantizer_cifar10.py
<examples/model_compress/quantization/BNN_quantizer_cifar10.py>`
""" """
def __init__(self, model, config_list, optimizer): def __init__(self, model, config_list, optimizer):
......
...@@ -13,9 +13,45 @@ logger = logging.getLogger(__name__) ...@@ -13,9 +13,45 @@ logger = logging.getLogger(__name__)
class DoReFaQuantizer(Quantizer): class DoReFaQuantizer(Quantizer):
"""Quantizer using the DoReFa scheme, as defined in: r"""
Zhou et al., DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients Quantizer using the DoReFa scheme, as defined in:
(https://arxiv.org/abs/1606.06160) `DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients <https://arxiv.org/abs/1606.06160>`__\ ,
authors Shuchang Zhou and Yuxin Wu provide an algorithm named DoReFa to quantize the weight, activation and gradients with training.
Parameters
----------
model : torch.nn.Module
Model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys for dict:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
When the type is int, all quantization types share same bits length.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
optimizer : torch.optim.Optimizer
Optimizer is required in `DoReFaQuantizer`, NNI will patch the optimizer and count the optimize step number.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import DoReFaQuantizer
>>> model = ...
>>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
>>> optimizer = ...
>>> quantizer = DoReFaQuantizer(model, config_list, optimizer)
>>> quantizer.compress()
>>> # Training Process...
For detailed example please refer to
:githublink:`examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py
<examples/model_compress/quantization/DoReFaQuantizer_torch_mnist.py>`.
""" """
def __init__(self, model, config_list, optimizer): def __init__(self, model, config_list, optimizer):
......
...@@ -11,35 +11,56 @@ logger = logging.getLogger(__name__) ...@@ -11,35 +11,56 @@ logger = logging.getLogger(__name__)
class LsqQuantizer(Quantizer): class LsqQuantizer(Quantizer):
"""Quantizer defined in: r"""
Learned Step Size Quantization (ICLR 2020) Quantizer defined in: `LEARNED STEP SIZE QUANTIZATION <https://arxiv.org/pdf/1902.08153.pdf>`__,
https://arxiv.org/pdf/1902.08153.pdf authors Steven K. Esser and Jeffrey L. McKinstry provide an algorithm to train the scales with gradients.
..
The authors introduce a novel means to estimate and scale the task loss gradient at each weight and activation
layer's quantizer step size, such that it can be learned in conjunction with other network parameters.
Parameters
----------
model : torch.nn.Module
The model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys for dict:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
When the type is int, all quantization types share same bits length.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
optimizer : torch.optim.Optimizer
Optimizer is required in `LsqQuantizer`, NNI will patch the optimizer and count the optimize step number.
dummy_input : Tuple[torch.Tensor]
Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
And then the batch normalization folding would be enabled. If dummy_input is not given,
the batch normalization folding would be disabled.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import LsqQuantizer
>>> model = ...
>>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
>>> optimizer = ...
>>> dummy_input = torch.rand(...)
>>> quantizer = LsqQuantizer(model, config_list, optimizer, dummy_input=dummy_input)
>>> quantizer.compress()
>>> # Training Process...
For detailed example please refer to
:githublink:`examples/model_compress/quantization/LSQ_torch_quantizer.py <examples/model_compress/quantization/LSQ_torch_quantizer.py>`.
""" """
def __init__(self, model, config_list, optimizer, dummy_input=None): def __init__(self, model, config_list, optimizer, dummy_input=None):
"""
Parameters
----------
model : torch.nn.Module
the model to be quantized
config_list : list of dict
list of configurations for quantization
supported keys for dict:
- quant_types : list of string
type of quantization you want to apply, currently support 'weight', 'input', 'output'
- quant_bits : int or dict of {str : int}
bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
when the type is int, all quantization types share same bits length
- quant_start_step : int
disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
state where output quantization ranges do not exclude a significant fraction of values, default value is 0
- op_types : list of string
types of nn.module you want to apply quantization, eg. 'Conv2d'
- dummy_input : tuple of tensor
inputs to the model, which are used to get the graph of the module. The graph is used to find
Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
given, the batch normalization folding would be disabled.
"""
assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type" assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
super().__init__(model, config_list, optimizer, dummy_input) super().__init__(model, config_list, optimizer, dummy_input)
device = next(model.parameters()).device device = next(model.parameters()).device
......
...@@ -12,7 +12,32 @@ logger = logging.getLogger(__name__) ...@@ -12,7 +12,32 @@ logger = logging.getLogger(__name__)
class NaiveQuantizer(Quantizer): class NaiveQuantizer(Quantizer):
"""quantize weight to 8 bits r"""
Quantize weight to 8 bits directly.
Parameters
----------
model : torch.nn.Module
Model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
when the type is int, all quantization types share same bits length.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import NaiveQuantizer
>>> model = ...
>>> NaiveQuantizer(model).compress()
""" """
def __init__(self, model, config_list, optimizer=None): def __init__(self, model, config_list, optimizer=None):
......
...@@ -14,7 +14,12 @@ logger = logging.getLogger(__name__) ...@@ -14,7 +14,12 @@ logger = logging.getLogger(__name__)
class ObserverQuantizer(Quantizer): class ObserverQuantizer(Quantizer):
"""This quantizer uses observers to record weight/output statistics to get quantization information. r"""
Observer quantizer is a framework of post-training quantization.
It will insert observers into the place where the quantization will happen.
During quantization calibration, each observer will record all the tensors it 'sees'.
These tensors will be used to calculate the quantization statistics after calibration.
The whole process can be divided into three steps: The whole process can be divided into three steps:
1. It will register observers to the place where quantization would happen (just like registering hooks). 1. It will register observers to the place where quantization would happen (just like registering hooks).
...@@ -23,6 +28,66 @@ class ObserverQuantizer(Quantizer): ...@@ -23,6 +28,66 @@ class ObserverQuantizer(Quantizer):
Note that the observer type, tensor dtype and quantization qscheme are hard coded for now. Their customization Note that the observer type, tensor dtype and quantization qscheme are hard coded for now. Their customization
are under development and will be ready soon. are under development and will be ready soon.
Parameters
----------
model : torch.nn.Module
Model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
when the type is int, all quantization types share same bits length.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
optimizer : torch.optim.Optimizer
Optimizer is optional in `ObserverQuantizer`.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
>>> model = ...
>>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
>>> quantizer = ObserverQuantizer(model, config_list)
>>> # define a calibration function
>>> def calibration(model, calib_loader):
>>> model.eval()
>>> with torch.no_grad():
>>> for data, _ in calib_loader:
>>> model(data)
>>> calibration(model, calib_loader)
>>> quantizer.compress()
For detailed example please refer to
:githublink:`examples/model_compress/quantization/observer_quantizer.py <examples/model_compress/quantization/observer_quantizer.py>`.
.. note::
This quantizer is still under development for now. Some quantizer settings are hard-coded:
- weight observer: per_tensor_symmetric, qint8
- output observer: per_tensor_affine, quint8, reduce_range=True
Other settings (such as quant_type and op_names) can be configured.
Notes
-----
**About the compress API**
Before the `compress` API is called, the model will only record tensors' statistics and no quantization process will be executed.
After the `compress` API is called, the model will NOT record tensors' statistics any more. The quantization scale and zero point will
be generated for each tensor and will be used to quantize each tensor during inference (we call it evaluation mode)
**About calibration**
Usually we pick up about 100 training/evaluation examples for calibration. If you found the accuracy is a bit low, try
to reduce the number of calibration examples.
""" """
def __init__(self, model, config_list, optimizer=None): def __init__(self, model, config_list, optimizer=None):
......
...@@ -107,36 +107,150 @@ def update_ema(biased_ema, value, decay): ...@@ -107,36 +107,150 @@ def update_ema(biased_ema, value, decay):
class QAT_Quantizer(Quantizer): class QAT_Quantizer(Quantizer):
"""Quantizer defined in: r"""
Quantizer defined in:
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf
Authors Benoit Jacob and Skirmantas Kligys provide an algorithm to quantize the model with training.
..
We propose an approach that simulates quantization effects in the forward pass of training.
Backpropagation still happens as usual, and all weights and biases are stored in floating point
so that they can be easily nudged by small amounts.
The forward propagation pass however simulates quantized inference as it will happen in the inference engine,
by implementing in floating-point arithmetic the rounding behavior of the quantization scheme:
* Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer,
the batch normalization parameters are “folded into” the weights before quantization.
* Activations are quantized at points where they would be during inference,
e.g. after the activation function is applied to a convolutional or fully connected layer’s output,
or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
Parameters
----------
model : torch.nn.Module
Model to be quantized.
config_list : List[Dict]
List of configurations for quantization. Supported keys for dict:
- quant_types : List[str]
Type of quantization you want to apply, currently support 'weight', 'input', 'output'.
- quant_bits : Union[int, Dict[str, int]]
Bits length of quantization, key is the quantization type, value is the length, eg. {'weight': 8},
When the type is int, all quantization types share same bits length.
- quant_start_step : int
Disable quantization until model are run by certain number of steps, this allows the network to enter a more stable.
State where output quantization ranges do not exclude a significant fraction of values, default value is 0.
- op_types : List[str]
Types of nn.module you want to apply quantization, eg. 'Conv2d'.
- op_names : List[str]
Names of nn.module you want to apply quantization, eg. 'conv1'.
- exclude : bool
Set True then the layers setting by op_types and op_names will be excluded from quantization.
optimizer : torch.optim.Optimizer
Optimizer is required in `QAT_Quantizer`, NNI will patch the optimizer and count the optimize step number.
dummy_input : Tuple[torch.Tensor]
Inputs to the model, which are used to get the graph of the module. The graph is used to find Conv-Bn patterns.
And then the batch normalization folding would be enabled. If dummy_input is not given,
the batch normalization folding would be disabled.
Examples
--------
>>> from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
>>> model = ...
>>> config_list = [{'quant_types': ['weight', 'input'], 'quant_bits': {'weight': 8, 'input': 8}, 'op_types': ['Conv2d']}]
>>> optimizer = ...
>>> dummy_input = torch.rand(...)
>>> quantizer = QAT_Quantizer(model, config_list, optimizer, dummy_input=dummy_input)
>>> quantizer.compress()
>>> # Training Process...
For detailed example please refer to
:githublink:`examples/model_compress/quantization/QAT_torch_quantizer.py <examples/model_compress/quantization/QAT_torch_quantizer.py>`.
Notes
-----
**Batch normalization folding**
Batch normalization folding is supported in QAT quantizer. It can be easily enabled by passing an argument `dummy_input` to
the quantizer, like:
.. code-block:: python
# assume your model takes an input of shape (1, 1, 28, 28)
# and dummy_input must be on the same device as the model
dummy_input = torch.randn(1, 1, 28, 28)
# pass the dummy_input to the quantizer
quantizer = QAT_Quantizer(model, config_list, dummy_input=dummy_input)
The quantizer will automatically detect Conv-BN patterns and simulate batch normalization folding process in the training
graph. Note that when the quantization aware training process is finished, the folded weight/bias would be restored after calling
`quantizer.export_model`.
**Quantization dtype and scheme customization**
Different backends on different devices use different quantization strategies (i.e. dtype (int or uint) and
scheme (per-tensor or per-channel and symmetric or affine)). QAT quantizer supports customization of mainstream dtypes and schemes.
There are two ways to set them. One way is setting them globally through a function named `set_quant_scheme_dtype` like:
.. code-block:: python
from nni.compression.pytorch.quantization.settings import set_quant_scheme_dtype
# This will set all the quantization of 'input' in 'per_tensor_affine' and 'uint' manner
set_quant_scheme_dtype('input', 'per_tensor_affine', 'uint)
# This will set all the quantization of 'output' in 'per_tensor_symmetric' and 'int' manner
set_quant_scheme_dtype('output', 'per_tensor_symmetric', 'int')
# This will set all the quantization of 'weight' in 'per_channel_symmetric' and 'int' manner
set_quant_scheme_dtype('weight', 'per_channel_symmetric', 'int')
The other way is more detailed. You can customize the dtype and scheme in each quantization config list like:
.. code-block:: python
config_list = [{
'quant_types': ['weight'],
'quant_bits': 8,
'op_types':['Conv2d', 'Linear'],
'quant_dtype': 'int',
'quant_scheme': 'per_channel_symmetric'
}, {
'quant_types': ['output'],
'quant_bits': 8,
'quant_start_step': 7000,
'op_types':['ReLU6'],
'quant_dtype': 'uint',
'quant_scheme': 'per_tensor_affine'
}]
**Multi-GPU training**
QAT quantizer natively supports multi-gpu training (DataParallel and DistributedDataParallel). Note that the quantizer
instantiation should happen before you wrap your model with DataParallel or DistributedDataParallel. For example:
.. code-block:: python
from torch.nn.parallel import DistributedDataParallel as DDP
from nni.algorithms.compression.pytorch.quantization import QAT_Quantizer
model = define_your_model()
model = QAT_Quantizer(model, **other_params) # <--- QAT_Quantizer instantiation
model = DDP(model)
for i in range(epochs):
train(model)
eval(model)
""" """
def __init__(self, model, config_list, optimizer, dummy_input=None): def __init__(self, model, config_list, optimizer, dummy_input=None):
"""
Parameters
----------
layer : LayerInfo
the layer to quantize
config_list : list of dict
list of configurations for quantization
supported keys for dict:
- quant_types : list of string
type of quantization you want to apply, currently support 'weight', 'input', 'output'
- quant_bits : int or dict of {str : int}
bits length of quantization, key is the quantization type, value is the length, eg. {'weight', 8},
when the type is int, all quantization types share same bits length
- quant_start_step : int
disable quantization until model are run by certain number of steps, this allows the network to enter a more stable
state where output quantization ranges do not exclude a significant fraction of values, default value is 0
- op_types : list of string
types of nn.module you want to apply quantization, eg. 'Conv2d'
- dummy_input : tuple of tensor
inputs to the model, which are used to get the graph of the module. The graph is used to find
Conv-Bn patterns. And then the batch normalization folding would be enabled. If dummy_input is not
given, the batch normalization folding would be disabled.
"""
assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type" assert isinstance(optimizer, torch.optim.Optimizer), "unrecognized optimizer type"
super().__init__(model, config_list, optimizer, dummy_input) super().__init__(model, config_list, optimizer, dummy_input)
self.quant_grad = QATGrad.apply self.quant_grad = QATGrad.apply
......
...@@ -160,9 +160,13 @@ class AMCTaskGenerator(TaskGenerator): ...@@ -160,9 +160,13 @@ class AMCTaskGenerator(TaskGenerator):
class AMCPruner(IterativePruner): class AMCPruner(IterativePruner):
""" r"""
A pytorch implementation of AMC: AutoML for Model Compression and Acceleration on Mobile Devices. AMC pruner leverages reinforcement learning to provide the model compression policy.
(https://arxiv.org/pdf/1802.03494.pdf) According to the author, this learning-based compression policy outperforms conventional rule-based compression policy by having a higher compression ratio,
better preserving the accuracy and freeing human labor.
For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.
Suggust config all `total_sparsity` in `config_list` a same value. Suggust config all `total_sparsity` in `config_list` a same value.
AMC pruner will treat the first sparsity in `config_list` as the global sparsity. AMC pruner will treat the first sparsity in `config_list` as the global sparsity.
...@@ -216,6 +220,18 @@ class AMCPruner(IterativePruner): ...@@ -216,6 +220,18 @@ class AMCPruner(IterativePruner):
target : str target : str
'flops' or 'params'. Note that the sparsity in other pruners always means the parameters sparse, but in AMC, you can choose flops sparse. 'flops' or 'params'. Note that the sparsity in other pruners always means the parameters sparse, but in AMC, you can choose flops sparse.
This parameter is used to explain what the sparsity setting in config_list refers to. This parameter is used to explain what the sparsity setting in config_list refers to.
Examples
--------
>>> from nni.algorithms.compression.v2.pytorch.pruning import AMCPruner
>>> config_list = [{'op_types': ['Conv2d'], 'total_sparsity': 0.5, 'max_sparsity_per_layer': 0.8}]
>>> dummy_input = torch.rand(...).to(device)
>>> evaluator = ...
>>> finetuner = ...
>>> pruner = AMCPruner(400, model, config_list, dummy_input, evaluator, finetuner=finetuner)
>>> pruner.compress()
The full script can be found :githublink:`here <examples/model_compress/pruning/v2/amc_pruning_torch.py>`.
""" """
def __init__(self, total_episode: int, model: Module, config_list: List[Dict], dummy_input: Tensor, def __init__(self, total_episode: int, model: Module, config_list: List[Dict], dummy_input: Tensor,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment