"ts/webui/git@developer.sourcefind.cn:OpenDAS/nni.git" did not exist on "588f299b3fb00290dd66716c8413f476162fe631"
Tutorial.rst 9.61 KB
Newer Older
colorjam's avatar
colorjam committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Tutorial
========

.. contents::

In this tutorial, we will explain more detailed usage about the model compression in NNI. 

Setup compression goal
----------------------

Specify the configuration
^^^^^^^^^^^^^^^^^^^^^^^^^

Users can specify the configuration (i.e., ``config_list``\ ) for a compression algorithm. For example, when compressing a model, users may want to specify the sparsity ratio, to specify different ratios for different types of operations, to exclude certain types of operations, or to compress only a certain types of operations. For users to express these kinds of requirements, we define a configuration specification. It can be seen as a python ``list`` object, where each element is a ``dict`` object. 

The ``dict``\ s in the ``list`` are applied one by one, that is, the configurations in latter ``dict`` will overwrite the configurations in former ones on the operations that are within the scope of both of them. 

There are different keys in a ``dict``. Some of them are common keys supported by all the compression algorithms:

colorjam's avatar
colorjam committed
20
* **op_types**\ : This is to specify what types of operations to be compressed. 'default' means following the algorithm's default setting. All suported module types are defined in :githublink:`default_layers.py <nni/compression/pytorch/default_layers.py>` for pytorch.
colorjam's avatar
colorjam committed
21
22
23
24
25
* **op_names**\ : This is to specify by name what operations to be compressed. If this field is omitted, operations will not be filtered by it.
* **exclude**\ : Default is False. If this field is True, it means the operations with specified types and names will be excluded from the compression.

Some other keys are often specific to a certain algorithm, users can refer to `pruning algorithms <./Pruner.rst>`__ and `quantization algorithms <./Quantizer.rst>`__ for the keys allowed by each algorithm.

colorjam's avatar
colorjam committed
26
To prune all ``Conv2d`` layers with the sparsity of 0.6, the configuration can be written as:
colorjam's avatar
colorjam committed
27
28
29

.. code-block:: python

colorjam's avatar
colorjam committed
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
   [{
    'sparsity': 0.6,
    'op_types': ['Conv2d']
   }]

To control the sparsity of specific layers, the configuration can be written as:

.. code-block:: python

   [{
      'sparsity': 0.8,
      'op_types': ['default']
   }, 
   {
      'sparsity': 0.6,
      'op_names': ['op_name1', 'op_name2']
   }, 
   {
      'exclude': True,
      'op_names': ['op_name3']
   }]
colorjam's avatar
colorjam committed
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

It means following the algorithm's default setting for compressed operations with sparsity 0.8, but for ``op_name1`` and ``op_name2`` use sparsity 0.6, and do not compress ``op_name3``.

Quantization specific keys
^^^^^^^^^^^^^^^^^^^^^^^^^^

Besides the keys explained above, if you use quantization algorithms you need to specify more keys in ``config_list``\ , which are explained below.

* **quant_types** : list of string. 

Type of quantization you want to apply, currently support 'weight', 'input', 'output'. 'weight' means applying quantization operation
to the weight parameter of modules. 'input' means applying quantization operation to the input of module forward method. 'output' means applying quantization operation to the output of module forward method, which is often called as 'activation' in some papers.


* **quant_bits** : int or dict of {str : int}

bits length of quantization, key is the quantization type, value is the quantization bits length, eg. 

69
.. code-block:: python
colorjam's avatar
colorjam committed
70
71

   {
colorjam's avatar
colorjam committed
72
73
74
75
      quant_bits: {
         'weight': 8,
         'output': 4,
         },
colorjam's avatar
colorjam committed
76
77
78
79
   }

when the value is int type, all quantization types share same bits length. eg. 

80
.. code-block:: python
colorjam's avatar
colorjam committed
81
82

   {
colorjam's avatar
colorjam committed
83
      quant_bits: 8, # weight or output quantization are all 8 bits
colorjam's avatar
colorjam committed
84
85
   }

86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
* **quant_dtype** : str or dict of {str : str}

quantization dtype, used to determine the range of quantized value. Two choices can be used:

- int: the range is singed
- uint: the range is unsigned

Two ways to set it. One is that the key is the quantization type, and the value is the quantization dtype, eg.

.. code-block:: python

   {
      quant_dtype: {
         'weight': 'int',
         'output': 'uint,
         },
   }

The other is that the value is str type, and all quantization types share the same dtype. eg.

.. code-block:: python

   {
      'quant_dtype': 'int', # the dtype of weight and output quantization are all 'int'
   }

There are totally two kinds of `quant_dtype` you can set, they are 'int' and 'uint'.

* **quant_scheme** : str or dict of {str : str}

quantization scheme, used to determine the quantization manners. Four choices can used:

- per_tensor_affine: per tensor, asymmetric quantization
- per_tensor_symmetric: per tensor, symmetric quantization
- per_channel_affine: per channel, asymmetric quantization
- per_channel_symmetric: per channel, symmetric quantization

Two ways to set it. One is that the key is the quantization type, value is the quantization scheme, eg.

.. code-block:: python

   {
      quant_scheme: {
         'weight': 'per_channel_symmetric',
         'output': 'per_tensor_affine',
         },
   }

The other is that the value is str type, all quantization types share the same quant_scheme. eg.

.. code-block:: python

   {
      quant_scheme: 'per_channel_symmetric', # the quant_scheme of weight and output quantization are all 'per_channel_symmetric'
   }

There are totally four kinds of `quant_scheme` you can set, they are 'per_tensor_affine', 'per_tensor_symmetric', 'per_channel_affine' and 'per_channel_symmetric'.

colorjam's avatar
colorjam committed
144
145
The following example shows a more complete ``config_list``\ , it uses ``op_names`` (or ``op_types``\ ) to specify the target layers along with the quantization bits for those layers.

146
.. code-block:: python
colorjam's avatar
colorjam committed
147
148

   config_list = [{
149
150
151
152
153
154
      'quant_types': ['weight'],
      'quant_bits': 8,
      'op_names': ['conv1'],
      'quant_dtype': 'int',
      'quant_scheme': 'per_channel_symmetric'
   },
colorjam's avatar
colorjam committed
155
156
157
158
   {
      'quant_types': ['weight'],
      'quant_bits': 4,
      'quant_start_step': 0,
159
160
161
162
      'op_names': ['conv2'],
      'quant_dtype': 'int',
      'quant_scheme': 'per_tensor_symmetric'
   },
colorjam's avatar
colorjam committed
163
164
165
   {
      'quant_types': ['weight'],
      'quant_bits': 3,
166
167
168
169
      'op_names': ['fc1'],
      'quant_dtype': 'int',
      'quant_scheme': 'per_tensor_symmetric'
   },
colorjam's avatar
colorjam committed
170
171
172
   {
      'quant_types': ['weight'],
      'quant_bits': 2,
173
174
175
      'op_names': ['fc2'],
      'quant_dtype': 'int',
      'quant_scheme': 'per_channel_symmetric'
colorjam's avatar
colorjam committed
176
   }]
colorjam's avatar
colorjam committed
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257

In this example, 'op_names' is the name of layer and four layers will be quantized to different quant_bits.


Export compression result
-------------------------

Export the pruned model
^^^^^^^^^^^^^^^^^^^^^^^

You can easily export the pruned model using the following API if you are pruning your model, ``state_dict`` of the sparse model weights will be stored in ``model.pth``\ , which can be loaded by ``torch.load('model.pth')``. Note that, the exported ``model.pth``\ has the same parameters as the original model except the masked weights are zero. ``mask_dict`` stores the binary value that produced by the pruning algorithm, which can be further used to speed up the model.

.. code-block:: python

   # export model weights and mask
   pruner.export_model(model_path='model.pth', mask_path='mask.pth')

   # apply mask to model
   from nni.compression.pytorch import apply_compression_results

   apply_compression_results(model, mask_file, device)


export model in ``onnx`` format(\ ``input_shape`` need to be specified):

.. code-block:: python

   pruner.export_model(model_path='model.pth', mask_path='mask.pth', onnx_path='model.onnx', input_shape=[1, 1, 28, 28])


Export the quantized model
^^^^^^^^^^^^^^^^^^^^^^^^^^

You can export the quantized model directly by using ``torch.save`` api and the quantized model can be loaded by ``torch.load`` without any extra modification. The following example shows the normal procedure of saving, loading quantized model and get related parameters in QAT.

.. code-block:: python
   
   # Save quantized model which is generated by using NNI QAT algorithm
   torch.save(model.state_dict(), "quantized_model.pth")

   # Simulate model loading procedure
   # Have to init new model and compress it before loading
   qmodel_load = Mnist()
   optimizer = torch.optim.SGD(qmodel_load.parameters(), lr=0.01, momentum=0.5)
   quantizer = QAT_Quantizer(qmodel_load, config_list, optimizer)
   quantizer.compress()
   
   # Load quantized model
   qmodel_load.load_state_dict(torch.load("quantized_model.pth"))

   # Get scale, zero_point and weight of conv1 in loaded model
   conv1 = qmodel_load.conv1
   scale = conv1.module.scale
   zero_point = conv1.module.zero_point
   weight = conv1.module.weight


Speed up the model
------------------

Masks do not provide real speedup of your model. The model should be speeded up based on the exported masks, thus, we provide an API to speed up your model as shown below. After invoking ``apply_compression_results`` on your model, your model becomes a smaller one with shorter inference latency.

.. code-block:: python

   from nni.compression.pytorch import apply_compression_results, ModelSpeedup

   dummy_input = torch.randn(config['input_shape']).to(device)
   m_speedup = ModelSpeedup(model, dummy_input, masks_file, device)
   m_speedup.speedup_model()


Please refer to `here <ModelSpeedup.rst>`__ for detailed description. The example code for model speedup can be found :githublink:`here <examples/model_compress/pruning/model_speedup.py>`


Control the Fine-tuning process
-------------------------------

Enhance the fine-tuning process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Knowledge distillation effectively learns a small student model from a large teacher model. Users can enhance the fine-tuning process that utilize knowledge distillation to improve the performance of the compressed model. Example code can be found :githublink:`here <examples/model_compress/pruning/finetune_kd_torch.py>`