Pruner.rst 35.3 KB
Newer Older
1
2
3
Supported Pruning Algorithms on NNI
===================================

4
We provide several pruning algorithms that support fine-grained weight pruning and structural filter pruning. **Fine-grained Pruning** generally results in  unstructured models, which need specialized hardware or software to speed up the sparse network. **Filter Pruning** achieves acceleration by removing the entire filter. Some pruning algorithms use one-shot method that prune weights at once based on an importance metric (It is necessary to finetune the model to compensate for the loss of accuracy). Other pruning algorithms **iteratively** prune weights during optimization, which control the pruning schedule, including some automatic pruning algorithms.
5
6


7
**One-shot Pruning**
J-shang's avatar
J-shang committed
8

9
* `Level Pruner <#level-pruner>`__ ((fine-grained pruning))
10
11
12
13
14
15
16
17
* `Slim Pruner <#slim-pruner>`__
* `FPGM Pruner <#fpgm-pruner>`__
* `L1Filter Pruner <#l1filter-pruner>`__
* `L2Filter Pruner <#l2filter-pruner>`__
* `Activation APoZ Rank Filter Pruner <#activationAPoZRankFilter-pruner>`__
* `Activation Mean Rank Filter Pruner <#activationmeanrankfilter-pruner>`__
* `Taylor FO On Weight Pruner <#taylorfoweightfilter-pruner>`__

18
**Iteratively Pruning**
19
20
21
22
23
24
25

* `AGP Pruner <#agp-pruner>`__
* `NetAdapt Pruner <#netadapt-pruner>`__
* `SimulatedAnnealing Pruner <#simulatedannealing-pruner>`__
* `AutoCompress Pruner <#autocompress-pruner>`__
* `AMC Pruner <#amc-pruner>`__
* `Sensitivity Pruner <#sensitivity-pruner>`__
26
* `ADMM Pruner <#admm-pruner>`__
27
28

**Others**
J-shang's avatar
J-shang committed
29

30
* `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
Bill Wu's avatar
Bill Wu committed
31
* `Transformer Head Pruner <#transformer-head-pruner>`__
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

Level Pruner
------------

This is one basic one-shot pruner: you can set a target sparsity level (expressed as a fraction, 0.6 means we will prune 60% of the weight parameters). 

We first sort the weights in the specified layer by their absolute values. And then mask to zero the smallest magnitude weights until the desired sparsity level is reached.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import LevelPruner
   config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
   pruner = LevelPruner(model, config_list)
   pruner.compress()

User configuration for Level Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.LevelPruner

59
**TensorFlow**
60
61
62

..  autoclass:: nni.algorithms.compression.tensorflow.pruning.LevelPruner

63

64
65
Slim Pruner
-----------
66
This is an one-shot pruner, which adds sparsity regularization on the scaling factors of batch normalization (BN) layers during training to identify unimportant channels. The channels with small scaling factor values will be pruned. For more details, please refer to `'Learning Efficient Convolutional Networks through Network Slimming' <https://arxiv.org/pdf/1708.06519.pdf>`__\.
67
68
69
70
71
72
73
74
75
76

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import SlimPruner
   config_list = [{ 'sparsity': 0.8, 'op_types': ['BatchNorm2d'] }]
J-shang's avatar
J-shang committed
77
   pruner = SlimPruner(model, config_list, optimizer, trainer, criterion)
78
79
80
81
82
83
84
85
86
87
88
89
   pruner.compress()

User configuration for Slim Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.SlimPruner

Reproduced Experiment
^^^^^^^^^^^^^^^^^^^^^

90
We implemented one of the experiments in `Learning Efficient Convolutional Networks through Network Slimming <https://arxiv.org/pdf/1708.06519.pdf>`__\ , we pruned ``70%`` channels in the **VGGNet** for CIFAR-10 in the paper, in which ``88.5%`` parameters are pruned. Our experiments results are as follows:
91
92
93
94
95
96
97
98
99
100

.. list-table::
   :header-rows: 1
   :widths: auto

   * - Model
     - Error(paper/ours)
     - Parameters
     - Pruned
   * - VGGNet
101
     - 6.34/6.69
102
103
104
     - 20.04M
     - 
   * - Pruned-VGGNet
105
     - 6.20/6.34
106
107
108
109
     - 2.03M
     - 88.5%


110
The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`
111

112
.. code-block:: python
113

114
   python basic_pruners_torch.py --pruner slim --model vgg19 --sparsity 0.7 --speed-up
115
116


117
----
118

119
120
FPGM Pruner
-----------
121

122
123
This is an one-shot pruner, which prunes filters with the smallest geometric median. FPGM chooses the filters with the most replaceable contribution.
For more details, please refer to `Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration <https://arxiv.org/pdf/1811.00250.pdf>`__.
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151

We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import FPGMPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
   pruner = FPGMPruner(model, config_list)
   pruner.compress()

User configuration for FPGM Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.FPGMPruner

L1Filter Pruner
---------------

152
This is an one-shot pruner, which prunes the filters in the **convolution layers**.
153
154
155
156
157
158
159
160
161
162
163
164
165
166

..
   The procedure of pruning m filters from the ith convolutional layer is as follows:

   #. For each filter :math:`F_{i,j}`, calculate the sum of its absolute kernel weights :math:`s_j=\sum_{l=1}^{n_i}\sum|K_l|`.

   #. Sort the filters by :math:`s_j`.

   #. Prune :math:`m` filters with the smallest sum values and their corresponding feature maps. The
      kernels in the next convolutional layer corresponding to the pruned feature maps are also removed.

   #. A new kernel matrix is created for both the :math:`i`-th and :math:`i+1`-th layers, and the remaining kernel
      weights are copied to the new model.

167
168
169
For more details, please refer to `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__\.


170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194

In addition, we also provide a dependency-aware mode for the L1FilterPruner. For more details about the dependency-aware mode, please reference `dependency-aware mode <./DependencyAware.rst>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import L1FilterPruner
   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
   pruner = L1FilterPruner(model, config_list)
   pruner.compress()

User configuration for L1Filter Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.L1FilterPruner

Reproduced Experiment
^^^^^^^^^^^^^^^^^^^^^

195
We implemented one of the experiments in `PRUNING FILTERS FOR EFFICIENT CONVNETS <https://arxiv.org/abs/1608.08710>`__ with **L1FilterPruner**\ , we pruned **VGG-16** for CIFAR-10 to **VGG-16-pruned-A** in the paper, in which ``64%`` parameters are pruned. Our experiments results are as follows:
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214

.. list-table::
   :header-rows: 1
   :widths: auto

   * - Model
     - Error(paper/ours)
     - Parameters
     - Pruned
   * - VGG-16
     - 6.75/6.49
     - 1.5x10^7
     - 
   * - VGG-16-pruned-A
     - 6.60/6.47
     - 5.4x10^6
     - 64.0%


215
216
217
218
219
The experiments code can be found at :githublink:`examples/model_compress/pruning/basic_pruners_torch.py <examples/model_compress/pruning/basic_pruners_torch.py>`

.. code-block:: python

   python basic_pruners_torch.py --pruner l1filter --model vgg16 --speed-up
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257

----

L2Filter Pruner
---------------

This is a structured pruning algorithm that prunes the filters with the smallest L2 norm of the weights. It is implemented as a one-shot pruner.

We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import L2FilterPruner
   config_list = [{ 'sparsity': 0.8, 'op_types': ['Conv2d'] }]
   pruner = L2FilterPruner(model, config_list)
   pruner.compress()

User configuration for L2Filter Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.L2FilterPruner

----

ActivationAPoZRankFilter Pruner
-------------------------------

ActivationAPoZRankFilter Pruner is a pruner which prunes the filters with the smallest importance criterion ``APoZ`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``APoZ`` is explained in the paper `Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures <https://arxiv.org/abs/1607.03250>`__.

The APoZ is defined as:

258
:math:`APoZ_{c}^{(i)} = APoZ\left(O_{c}^{(i)}\right)=\frac{\sum_{k}^{N} \sum_{j}^{M} f\left(O_{c, j}^{(i)}(k)=0\right)}{N \times M}`
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274


We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import ActivationAPoZRankFilterPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
J-shang's avatar
J-shang committed
275
   pruner = ActivationAPoZRankFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1)
276
277
278
279
   pruner.compress()

Note: ActivationAPoZRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.

280
You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
281
282
283
284
285
286
287
288
289
290
291
292
293

User configuration for ActivationAPoZRankFilter Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationAPoZRankFilterPruner

----

ActivationMeanRankFilter Pruner
-------------------------------

294
ActivationMeanRankFilterPruner is a pruner which prunes the filters with the smallest importance criterion ``mean activation`` calculated from the output activations of convolution layers to achieve a preset level of network sparsity. The pruning criterion ``mean activation`` is explained in section 2.2 of the paper `Pruning Convolutional Neural Networks for Resource Efficient Inference <https://arxiv.org/abs/1611.06440>`__. Other pruning criteria mentioned in this paper will be supported in future release.
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309

We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import ActivationMeanRankFilterPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
J-shang's avatar
J-shang committed
310
   pruner = ActivationMeanRankFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1)
311
312
313
314
   pruner.compress()

Note: ActivationMeanRankFilterPruner is used to prune convolutional layers within deep neural networks, therefore the ``op_types`` field supports only convolutional layers.

315
You can view :githublink:`example <examples/model_compress/pruning/basic_pruners_torch.py>` for more information.
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332

User configuration for ActivationMeanRankFilterPruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.ActivationMeanRankFilterPruner

----

TaylorFOWeightFilter Pruner
---------------------------

TaylorFOWeightFilter Pruner is a pruner which prunes convolutional layers based on estimated importance calculated from the first order taylor expansion on weights to achieve a preset level of network sparsity. The estimated importance of filters is defined as the paper `Importance Estimation for Neural Network Pruning <http://jankautz.com/publications/Importance4NNPruning_CVPR19.pdf>`__. Other pruning criteria mentioned in this paper will be supported in future release.

..

333
:math:`\widehat{\mathcal{I}}_{\mathcal{S}}^{(1)}(\mathbf{W}) \triangleq \sum_{s \in \mathcal{S}} \mathcal{I}_{s}^{(1)}(\mathbf{W})=\sum_{s \in \mathcal{S}}\left(g_{s} w_{s}\right)^{2}`
334
335
336
337


We also provide a dependency-aware mode for this pruner to get better speedup from the pruning. Please reference `dependency-aware <./DependencyAware.rst>`__ for more details.

338
339
What's more, we provide a global-sort mode for this pruner which is aligned with paper implementation. Please set parameter 'global_sort' to True when instantiate TaylorFOWeightFilterPruner.

340
341
342
343
344
345
346
347
348
349
350
351
Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import TaylorFOWeightFilterPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
J-shang's avatar
J-shang committed
352
   pruner = TaylorFOWeightFilterPruner(model, config_list, optimizer, trainer, criterion, sparsifying_training_batches=1)
353
354
355
356
357
358
359
360
361
362
363
364
365
366
   pruner.compress()

User configuration for TaylorFOWeightFilter Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.TaylorFOWeightFilterPruner

----

AGP Pruner
----------

367
This is an iterative pruner, which the sparsity is increased from an initial sparsity value si (usually 0) to a final sparsity value sf over a span of n pruning steps, starting at training step :math:`t_{0}` and with pruning frequency :math:`\Delta t`:
368

369
:math:`s_{t}=s_{f}+\left(s_{i}-s_{f}\right)\left(1-\frac{t-t_{0}}{n \Delta t}\right)^{3} \text { for } t \in\left\{t_{0}, t_{0}+\Delta t, \ldots, t_{0} + n \Delta t\right\}`
370

371
For more details please refer to `To prune, or not to prune: exploring the efficacy of pruning for model compression <https://arxiv.org/abs/1710.01878>`__\.
372
373
374
375
376


Usage
^^^^^

377
You can prune all weights from 0% to 80% sparsity in 10 epoch with the code below.
378
379
380
381
382
383
384

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import AGPPruner
   config_list = [{
385
       'sparsity': 0.8,
386
387
388
389
390
391
392
393
394
395
396
       'op_types': ['default']
   }]

   # load a pretrained model or train a model before using a pruner
   # model = MyModel()
   # model.load_state_dict(torch.load('mycheckpoint.pth'))

   # AGP pruner prunes model while fine tuning the model by adding a hook on
   # optimizer.step(), so an optimizer is required to prune the model.
   optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)

J-shang's avatar
J-shang committed
397
   pruner = AGPPruner(model, config_list, optimizer, trainer, criterion, pruning_algorithm='level')
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
   pruner.compress()

AGP pruner uses ``LevelPruner`` algorithms to prune the weight by default, however you can set ``pruning_algorithm`` parameter to other values to use other pruning algorithms:


* ``level``\ : LevelPruner
* ``slim``\ : SlimPruner
* ``l1``\ : L1FilterPruner
* ``l2``\ : L2FilterPruner
* ``fpgm``\ : FPGMPruner
* ``taylorfo``\ : TaylorFOWeightFilterPruner
* ``apoz``\ : ActivationAPoZRankFilterPruner
* ``mean_activation``\ : ActivationMeanRankFilterPruner


User configuration for AGP Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.AGPPruner

----

NetAdapt Pruner
---------------

NetAdapt allows a user to automatically simplify a pretrained network to meet the resource budget. 
Given the overall sparsity, NetAdapt will automatically generate the sparsities distribution among different layers by iterative pruning.

For more details, please refer to `NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications <https://arxiv.org/abs/1804.03230>`__.


Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import NetAdaptPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
   pruner = NetAdaptPruner(model, config_list, short_term_fine_tuner=short_term_fine_tuner, evaluator=evaluator,base_algo='l1', experiment_data_dir='./')
   pruner.compress()

446
You can view :githublink:`example <examples/model_compress/pruning/auto_pruners_torch.py>` for more information.
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486

User configuration for NetAdapt Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.NetAdaptPruner

SimulatedAnnealing Pruner
-------------------------

We implement a guided heuristic search method, Simulated Annealing (SA) algorithm, with enhancement on guided search based on prior experience. 
The enhanced SA technique is based on the observation that a DNN layer with more number of weights often has a higher degree of model compression with less impact on overall accuracy.


* Randomly initialize a pruning rate distribution (sparsities).
* While current_temperature < stop_temperature:

  #. generate a perturbation to current distribution
  #. Perform fast evaluation on the perturbated distribution
  #. accept the perturbation according to the performance and probability, if not accepted, return to step 1
  #. cool down, current_temperature <- current_temperature * cool_down_rate

For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import SimulatedAnnealingPruner
   config_list = [{
       'sparsity': 0.5,
       'op_types': ['Conv2d']
   }]
   pruner = SimulatedAnnealingPruner(model, config_list, evaluator=evaluator, base_algo='l1', cool_down_rate=0.9, experiment_data_dir='./')
   pruner.compress()

487
You can view :githublink:`example <examples/model_compress/pruning/auto_pruners_torch.py>` for more information.
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516

User configuration for SimulatedAnnealing Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.SimulatedAnnealingPruner

AutoCompress Pruner
-------------------

For each round, AutoCompressPruner prune the model for the same sparsity to achive the overall sparsity:

.. code-block:: bash

       1. Generate sparsities distribution using SimulatedAnnealingPruner
       2. Perform ADMM-based structured pruning to generate pruning result for the next round.
          Here we use `speedup` to perform real pruning.


For more details, please refer to `AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates <https://arxiv.org/abs/1907.03141>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

517
   from nni.algorithms.compression.pytorch.pruning import AutoCompressPruner
518
519
520
521
522
523
524
525
526
527
   config_list = [{
           'sparsity': 0.5,
           'op_types': ['Conv2d']
       }]
   pruner = AutoCompressPruner(
               model, config_list, trainer=trainer, evaluator=evaluator,
               dummy_input=dummy_input, num_iterations=3, optimize_mode='maximize', base_algo='l1',
               cool_down_rate=0.9, admm_num_iterations=30, admm_training_epochs=5, experiment_data_dir='./')
   pruner.compress()

528
You can view :githublink:`example <examples/model_compress/pruning/auto_pruners_torch.py>` for more information.
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560

User configuration for AutoCompress Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.AutoCompressPruner

AMC Pruner
----------

AMC pruner leverages reinforcement learning to provide the model compression policy.
This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio,
better preserving the accuracy and freeing human labor.


For more details, please refer to `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import AMCPruner
   config_list = [{
           'op_types': ['Conv2d', 'Linear']
       }]
   pruner = AMCPruner(model, config_list, evaluator, val_loader, flops_ratio=0.5)
   pruner.compress()

561
You can view :githublink:`example <examples/model_compress/pruning/amc/>` for more information.
562

563
User configuration for AMC Pruner
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.AMCPruner

Reproduced Experiment
^^^^^^^^^^^^^^^^^^^^^

We implemented one of the experiments in `AMC: AutoML for Model Compression and Acceleration on Mobile Devices <https://arxiv.org/pdf/1802.03494.pdf>`__\ , we pruned **MobileNet** to 50% FLOPS for ImageNet in the paper. Our experiments results are as follows:

.. list-table::
   :header-rows: 1
   :widths: auto

   * - Model
     - Top 1 acc.(paper/ours)
     - Top 5 acc. (paper/ours)
     - FLOPS
   * - MobileNet
     - 70.5% / 69.9%
     - 89.3% / 89.1%
     - 50%


589
The experiments code can be found at :githublink:`examples/model_compress/pruning/ <examples/model_compress/pruning/amc/>`
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619

ADMM Pruner
-----------

Alternating Direction Method of Multipliers (ADMM) is a mathematical optimization technique,
by decomposing the original nonconvex problem into two subproblems that can be solved iteratively. In weight pruning problem, these two subproblems are solved via 1) gradient descent algorithm and 2) Euclidean projection respectively. 

During the process of solving these two subproblems, the weights of the original model will be changed. An one-shot pruner will then be applied to prune the model according to the config list given.

This solution framework applies both to non-structured and different variations of structured pruning schemes.

For more details, please refer to `A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers <https://arxiv.org/abs/1804.03294>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import ADMMPruner
   config_list = [{
               'sparsity': 0.8,
               'op_types': ['Conv2d'],
               'op_names': ['conv1']
           }, {
               'sparsity': 0.92,
               'op_types': ['Conv2d'],
               'op_names': ['conv2']
           }]
J-shang's avatar
J-shang committed
620
   pruner = ADMMPruner(model, config_list, trainer, num_iterations=30, epochs_per_iteration=5)
621
622
   pruner.compress()

623
You can view :githublink:`example <examples/model_compress/pruning/auto_pruners_torch.py>` for more information.
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682

User configuration for ADMM Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.ADMMPruner

Lottery Ticket Hypothesis
-------------------------

`The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks <https://arxiv.org/abs/1803.03635>`__\ , authors Jonathan Frankle and Michael Carbin,provides comprehensive measurement and analysis, and articulate the *lottery ticket hypothesis*\ : dense, randomly-initialized, feed-forward networks contain subnetworks (*winning tickets*\ ) that -- when trained in isolation -- reach test accuracy comparable to the original network in a similar number of iterations.

In this paper, the authors use the following process to prune a model, called *iterative prunning*\ :

..

   #. Randomly initialize a neural network f(x;theta_0) (where theta\ *0 follows D*\ {theta}).
   #. Train the network for j iterations, arriving at parameters theta_j.
   #. Prune p% of the parameters in theta_j, creating a mask m.
   #. Reset the remaining parameters to their values in theta_0, creating the winning ticket f(x;m*theta_0).
   #. Repeat step 2, 3, and 4.


If the configured final sparsity is P (e.g., 0.8) and there are n times iterative pruning, each iterative pruning prunes 1-(1-P)^(1/n) of the weights that survive the previous round.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import LotteryTicketPruner
   config_list = [{
       'prune_iterations': 5,
       'sparsity': 0.8,
       'op_types': ['default']
   }]
   pruner = LotteryTicketPruner(model, config_list, optimizer)
   pruner.compress()
   for _ in pruner.get_prune_iterations():
       pruner.prune_iteration_start()
       for epoch in range(epoch_num):
           ...

The above configuration means that there are 5 times of iterative pruning. As the 5 times iterative pruning are executed in the same run, LotteryTicketPruner needs ``model`` and ``optimizer`` (\ **Note that should add ``lr_scheduler`` if used**\ ) to reset their states every time a new prune iteration starts. Please use ``get_prune_iterations`` to get the pruning iterations, and invoke ``prune_iteration_start`` at the beginning of each iteration. ``epoch_num`` is better to be large enough for model convergence, because the hypothesis is that the performance (accuracy) got in latter rounds with high sparsity could be comparable with that got in the first round.


User configuration for LotteryTicket Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.LotteryTicketPruner

Reproduced Experiment
^^^^^^^^^^^^^^^^^^^^^

683
We try to reproduce the experiment result of the fully connected network on MNIST using the same configuration as in the paper. The code can be referred :githublink:`here <examples/model_compress/pruning/lottery_torch_mnist_fc.py>`. In this experiment, we prune 10 times, for each pruning we train the pruned model for 50 epochs.
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727


.. image:: ../../img/lottery_ticket_mnist_fc.png
   :target: ../../img/lottery_ticket_mnist_fc.png
   :alt: 


The above figure shows the result of the fully connected network. ``round0-sparsity-0.0`` is the performance without pruning. Consistent with the paper, pruning around 80% also obtain similar performance compared to non-pruning, and converges a little faster. If pruning too much, e.g., larger than 94%, the accuracy becomes lower and convergence becomes a little slower. A little different from the paper, the trend of the data in the paper is relatively more clear.

Sensitivity Pruner
------------------

For each round, SensitivityPruner prunes the model based on the sensitivity to the accuracy of each layer until meeting the final configured sparsity of the whole model:

.. code-block:: bash

       1. Analyze the sensitivity of each layer in the current state of the model.
       2. Prune each layer according to the sensitivity.


For more details, please refer to `Learning both Weights and Connections for Efficient Neural Networks  <https://arxiv.org/abs/1506.02626>`__.

Usage
^^^^^

PyTorch code

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import SensitivityPruner
   config_list = [{
           'sparsity': 0.5,
           'op_types': ['Conv2d']
       }]
   pruner = SensitivityPruner(model, config_list, finetuner=fine_tuner, evaluator=evaluator)
   # eval_args and finetune_args are the parameters passed to the evaluator and finetuner respectively
   pruner.compress(eval_args=[model], finetune_args=[model])

User configuration for Sensitivity Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.SensitivityPruner
Bill Wu's avatar
Bill Wu committed
728
729
730
731
732
733
734
735
736
737
738
739

Transformer Head Pruner
-----------------------

Transformer Head Pruner is a tool designed for pruning attention heads from the models belonging to the `Transformer family <https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf>`__. The following image from `Efficient Transformers: A Survey <https://arxiv.org/pdf/2009.06732.pdf>`__ gives a good overview the general structure of the Transformer.

.. image:: ../../img/transformer_structure.png
   :target: ../../img/transformer_structure.png
   :alt: 

Typically, each attention layer in the Transformer models consists of four weights: three projection matrices for query, key, value, and an output projection matrix. The outputs of the former three matrices contains the projected results for all heads. Normally, the results are then reshaped so that each head performs that attention computation independently. The final results are concatenated back before fed into the output projection. Therefore, when an attention head is pruned, the same weights corresponding to that heads in the three projection matrices are pruned. Also, the weights in the output projection corresponding to the head's output are pruned. In our implementation, we calculate and apply masks to the four matrices together.

740
Note: currently, the pruner can only handle models with projection weights written as separate ``Linear`` modules, i.e., it expects four ``Linear`` modules corresponding to query, key, value, and an output projections. Therefore, in the ``config_list``, you should either write ``['Linear']`` for the ``op_types`` field, or write names corresponding to ``Linear`` modules for the ``op_names`` field. For instance, the `Huggingface transformers <https://huggingface.co/transformers/index.html>`_ are supported, but ``torch.nn.Transformer`` is not.
Bill Wu's avatar
Bill Wu committed
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758

The pruner implements the following algorithm:

.. code-block:: bash

    Repeat for each pruning iteration (1 for one-shot pruning):
       1. Calculate importance scores for each head in each specified layer using a specific criterion.
       2. Sort heads locally or globally, and prune out some heads with lowest scores. The number of pruned heads is determined according to the sparsity specified in the config.
       3. If the specified pruning iteration is larger than 1 (iterative pruning), finetune the model for a while before the next pruning iteration.

Currently, the following head sorting criteria are supported:

    * "l1_weight": rank heads by the L1-norm of weights of the query, key, and value projection matrices.
    * "l2_weight": rank heads by the L2-norm of weights of the query, key, and value projection matrices.
    * "l1_activation": rank heads by the L1-norm of their attention computation output.
    * "l2_activation": rank heads by the L2-norm of their attention computation output.
    * "taylorfo": rank heads by l1 norm of the output of attention computation * gradient for this output. Check more details in `this paper <https://arxiv.org/abs/1905.10650>`__ and `this one <https://arxiv.org/abs/1611.06440>`__.

759
We support local sorting (i.e., sorting heads within a layer) and global sorting (sorting all heads together), and you can control by setting the ``global_sort`` parameter. Note that if ``global_sort=True`` is passed, all weights must have the same sparsity in the config list. However, this does not mean that each layer will be prune to the same sparsity as specified. This sparsity value will be interpreted as a global sparsity, and each layer is likely to have different sparsity after pruning by global sort. As a reminder, we found that if global sorting is used, it is usually helpful to use an iterative pruning scheme, interleaving pruning with intermediate finetuning, since global sorting often results in non-uniform sparsity distributions, which makes the model more susceptible to forgetting.
Bill Wu's avatar
Bill Wu committed
760

761
In our implementation, we support two ways to group the four weights in the same layer together. You can either pass a nested list containing the names of these modules as the pruner's initialization parameters (usage below), or simply pass a dummy input instead and the pruner will run ``torch.jit.trace`` to group the weights (experimental feature). However, if you would like to assign different sparsity to each layer, you can only use the first option, i.e., passing names of the weights to the pruner (see usage below). Also, note that we require the weights belonging to the same layer to have the same sparsity.
Bill Wu's avatar
Bill Wu committed
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786

Usage
^^^^^

Suppose we want to prune a BERT with Huggingface implementation, which has the following architecture (obtained by calling ``print(model)``). Note that we only show the first layer of the repeated layers in the encoder's ``ModuleList layer``. 

.. image:: ../../img/huggingface_bert_architecture.png
   :target: ../../img/huggingface_bert_architecture.png
   :alt: 

**Usage Example: one-shot pruning, assigning sparsity 0.5 to the first six layers and sparsity 0.25 to the last six layers (PyTorch code)**. Note that

* Here we specify ``op_names`` in the config list to assign different sparsity to different layers.
* Meanwhile, we pass ``attention_name_groups`` to the pruner so that the pruner may group together the weights belonging to the same attention layer.
* Since in this example we want to do one-shot pruning, the ``num_iterations`` parameter is set to 1, and the parameter ``epochs_per_iteration`` is ignored. If you would like to do iterative pruning instead, you can set the ``num_iterations`` parameter to the number of pruning iterations, and the ``epochs_per_iteration`` parameter to the number of finetuning epochs between two iterations.
* The arguments ``trainer`` and ``optimizer`` are only used when we want to do iterative pruning, or the ranking criterion is ``taylorfo``. Here these two parameters are ignored by the pruner.
* The argument ``forward_runner`` is only used when the ranking criterion is ``l1_activation`` or ``l2_activation``. Here this parameter is ignored by the pruner.

.. code-block:: python

   from nni.algorithms.compression.pytorch.pruning import TransformerHeadPruner
   attention_name_groups = list(zip(["encoder.layer.{}.attention.self.query".format(i) for i in range(12)],
                                    ["encoder.layer.{}.attention.self.key".format(i) for i in range(12)],
                                    ["encoder.layer.{}.attention.self.value".format(i) for i in range(12)],
                                    ["encoder.layer.{}.attention.output.dense".format(i) for i in range(12)]))
787

Bill Wu's avatar
Bill Wu committed
788
789
790
791
792
793
794
795
796
797
   kwargs = {"ranking_criterion": "l1_weight",
             "global_sort": False,
             "num_iterations": 1,
             "epochs_per_iteration": 1,    # this is ignored when num_iterations = 1
             "head_hidden_dim": 64,
             "attention_name_groups": attention_name_groups,
             "trainer": trainer,
             "optimizer": optimizer,
             "forward_runner": forward_runner
             }
798
799
   config_list = [{
	"sparsity": 0.5,
Bill Wu's avatar
Bill Wu committed
800
801
        "op_types": ["Linear"],
        "op_names": [x for layer in attention_name_groups[:6] for x in layer]      # first six layers
802
803
804
805
806
807
   }, {
	"sparsity": 0.25,
	"op_types": ["Linear"],
	"op_names": [x for layer in attention_name_groups[6:] for x in layer]      # last six layers
   }]

Bill Wu's avatar
Bill Wu committed
808
809
810
   pruner = TransformerHeadPruner(model, config_list, **kwargs)
   pruner.compress()

811
812
813
814
815
816
817
In addition to this usage guide, we provide a more detailed example of pruning BERT (Huggingface implementation) for transfer learning on the tasks from the `GLUE benchmark <https://gluebenchmark.com/>`_. Please find it in this :githublink:`page <examples/model_compress/pruning/transformers>`. To run the example, first make sure that you install the package ``transformers`` and ``datasets``. Then, you may start by running the following command:

.. code-block:: bash

	./run.sh gpu_id glue_task

By default, the code will download a pretrained BERT language model, and then finetune for several epochs on the downstream GLUE task. Then, the ``TransformerHeadPruner`` will be used to prune out heads from each layer by a certain criterion (by default, the code lets the pruner uses magnitude ranking, and prunes out 50% of the heads in each layer in an one-shot manner). Finally, the pruned model will be finetuned in the downstream task for several epochs. You can check the details of pruning from the logs printed out by the example. You can also experiment with different pruning settings by changing the parameters in ``run.sh``, or directly changing the ``config_list`` in ``transformer_pruning.py``.
Bill Wu's avatar
Bill Wu committed
818
819
820
821
822
823
824
   
User configuration for Transformer Head Pruner
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**PyTorch**

..  autoclass:: nni.algorithms.compression.pytorch.pruning.TransformerHeadPruner