QuickStart.rst 11.4 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
Tutorial for Model Compression
==============================

.. contents::

In this tutorial, we use the `first section <#quick-start-to-compress-a-model>`__ to quickly go through the usage of model compression on NNI. Then use the `second section <#detailed-usage-guide>`__ to explain more details of the usage.

Quick Start to Compress a Model
-------------------------------

11
NNI provides very simple APIs for compressing a model. The compression includes pruning algorithms and quantization algorithms. The usage of them are the same, thus, here we use `slim pruner <../Compression/Pruner.rst#slim-pruner>`__ as an example to show the usage.
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

Write configuration
^^^^^^^^^^^^^^^^^^^

Write a configuration to specify the layers that you want to prune. The following configuration means pruning all the ``BatchNorm2d``\ s to sparsity 0.7 while keeping other layers unpruned.

.. code-block:: python

   configure_list = [{
       'sparsity': 0.7,
       'op_types': ['BatchNorm2d'],
   }]

The specification of configuration can be found `here <#specification-of-config-list>`__. Note that different pruners may have their own defined fields in configuration, for exmaple ``start_epoch`` in AGP pruner. Please refer to each pruner's `usage <./Pruner.rst>`__ for details, and adjust the configuration accordingly.

Choose a compression algorithm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Choose a pruner to prune your model. First instantiate the chosen pruner with your model and configuration as arguments, then invoke ``compress()`` to compress your model.

.. code-block:: python

   pruner = SlimPruner(model, configure_list)
   model = pruner.compress()

Then, you can train your model using traditional training approach (e.g., SGD), pruning is applied transparently during the training. Some pruners prune once at the beginning, the following training can be seen as fine-tune. Some pruners prune your model iteratively, the masks are adjusted epoch by epoch during training.

Export compression result
^^^^^^^^^^^^^^^^^^^^^^^^^

After training, you get accuracy of the pruned model. You can export model weights to a file, and the generated masks to a file as well. Exporting onnx model is also supported.

.. code-block:: python

   pruner.export_model(model_path='pruned_vgg19_cifar10.pth', mask_path='mask_vgg19_cifar10.pth')

48
Please refer :githublink:`mnist example <examples/model_compress/pruning/naive_prune_torch.py>` for quick start.
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

Speed up the model
^^^^^^^^^^^^^^^^^^

Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.

.. code-block:: python

   from nni.compression.pytorch import apply_compression_results
   apply_compression_results(model, 'mask_vgg19_cifar10.pth')

Please refer to `here <ModelSpeedup.rst>`__ for detailed description.

Detailed Usage Guide
--------------------

The example code for users to apply model compression on a user model can be found below:

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import LevelPruner
   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
   pruner = LevelPruner(model, config_list)
   pruner.compress()

76
You can use other compression algorithms in the package of ``nni.compression``. The algorithms are implemented in both PyTorch and TensorFlow (partial support on TensorFlow), under ``nni.compression.pytorch`` and ``nni.compression.tensorflow`` respectively. You can refer to `Pruner <./Pruner.rst>`__ and `Quantizer <./Quantizer.rst>`__ for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to `KDExample <../TrialExample/KDExample.rst>`__
77
78
79
80
81

A compression algorithm is first instantiated with a ``config_list`` passed in. The specification of this ``config_list`` will be described later.

The function call ``pruner.compress()`` modifies user defined model (in Tensorflow the model can be obtained with ``tf.get_default_graph()``\ , while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.

82
Note that, ``pruner.compress`` simply adds masks on model weights, it does not include fine tuning logic. If users want to fine tune the compressed model, they need to write the fine tune logic by themselves after ``pruner.compress``.
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97

Specification of ``config_list``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example,when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 

The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 

There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:


* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting.
* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.

98
Some other keys are often specific to a certain algorithms, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183

A simple example of configuration is shown below:

.. code-block:: python

   [
       {
           'sparsity': 0.8,
           'op_types': ['default']
       },
       {
           'sparsity': 0.6,
           'op_names': ['op_name1', 'op_name2']
       },
       {
           'exclude': True,
           'op_names': ['op_name3']
       }
   ]

It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.

Quantization specific keys
^^^^^^^^^^^^^^^^^^^^^^^^^^

Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.


* **quant_types** : list of string. 

Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.


* **quant_bits** : int or dict of {str : int}

bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 

.. code-block:: bash

   {
       quant_bits: {
           'weight': 8,
           'output': 4,
           },
   }

when the value is int type, all quantization types share same bits length. eg. 

.. code-block:: bash

   {
       quant_bits: 8, # weight or output quantization are all 8 bits
   }

The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.

.. code-block:: bash

   configure_list = [{
           'quant_types': ['weight'],        
           'quant_bits': 8, 
           'op_names': ['conv1']
       }, {
           'quant_types': ['weight'],
           'quant_bits': 4,
           'quant_start_step': 0,
           'op_names': ['conv2']
       }, {
           'quant_types': ['weight'],
           'quant_bits': 3,
           'op_names': ['fc1']
           },
          {
           'quant_types': ['weight'],
           'quant_bits': 2,
           'op_names': ['fc2']
           }
   ]

In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.

APIs for Updating Fine Tuning Status
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

184
Some compression algorithms use epochs to control the progress of compression (e.g. `AGP <../Compression/Pruner.rst#agp-pruner>`__\ ), and some algorithms need to do something after every minibatch. Therefore, we provide another two APIs for users to invoke: ``pruner.update_epoch(epoch)`` and ``pruner.step()``.
185
186
187

``update_epoch`` should be invoked in every epoch, while ``step`` should be invoked after each minibatch. Note that most algorithms do not require calling the two APIs. Please refer to each algorithm's document for details. For the algorithms that do not need them, calling them is allowed but has no effect.

188
189
Export Pruned Model
^^^^^^^^^^^^^^^^^^^^
190

191
You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. In this exported ``model.pth``\ , the masked weights are zero.
192
193
194
195
196
197
198
199
200
201
202

.. code-block:: bash

   pruner.export_model(model_path='model.pth')

``mask_dict`` and pruned model in ``onnx`` format(\ ``input_shape`` need to be specified) can also be exported like this:

.. code-block:: python

   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])

203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
Export Quantized Model
^^^^^^^^^^^^^^^^^^^^^^
You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.

.. code-block:: python

   # Init model and quantize it by using NNI QAT
   model = Mnist()
   configure_list = [...]
   optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
   quantizer = QAT_Quantizer(model, configure_list, optimizer)
   quantizer.compress()

   model.to(device)
   
   # Quantize aware training
   for epoch in range(40):
        print('# Epoch {} #'.format(epoch))
        train(model, quantizer, device, train_loader, optimizer)
   
   # Save quantized model which is generated by using NNI QAT algorithm
   torch.save(model.state_dict(), "quantized_model.pkt")

   # Simulate model loading procedure
   # Have to init new model and compress it before loading
   qmodel_load = Mnist()
   optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
   quantizer = QAT_Quantizer(qmodel_load, configure_list, optimizer)
   quantizer.compress()
   
   # Load quantized model
   qmodel_load.load_state_dict(torch.load("quantized_model.pkt"))

   # Get scale, zero_point and weight of conv1 in loaded model
   conv1 = qmodel_load.conv1
   scale = conv1.module.scale
   zero_point = conv1.module.zero_point
   weight = conv1.module.weight

242
If you want to really speed up the compressed model, please refer to `NNI model speedup <./ModelSpeedup.rst>`__ for details.