Knowledge Distillation support (#1819)

a2210436 · Tang Lang · QuanluZhang · 665790fc · a2210436 · a2210436
Commit a2210436 authored Dec 11, 2019 by Tang Lang Committed by QuanluZhang Dec 11, 2019
5 changed files
--- a/docs/en_US/Compressor/Overview.md
+++ b/docs/en_US/Compressor/Overview.md
@@ -6,7 +6,9 @@ NNI provides an easy-to-use toolkit to help user design and use compression algo

 ## Supported algorithms

-We have provided two naive compression algorithms and three popular ones for users, including two pruning algorithms and three quantization algorithms:
+We have provided several compression algorithms, including several pruning and quantization algorithms:
+
+**Pruning**

 |Name|Brief Introduction of Algorithm|
 |---|---|
@@ -16,6 +18,11 @@ We have provided two naive compression algorithms and three popular ones for use
 | [Slim Pruner](./Pruner.md#slim-pruner) | Pruning channels in convolution layers by pruning scaling factors in BN layers(Learning Efficient Convolutional Networks through Network Slimming)[Reference Paper](https://arxiv.org/abs/1708.06519) |
 | [Lottery Ticket Pruner](./Pruner.md#agp-pruner) | The pruning process used by "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks". It prunes a model iteratively. [Reference Paper](https://arxiv.org/abs/1803.03635)|
 | [FPGM Pruner](./Pruner.md#fpgm-pruner) | Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [Reference Paper](https://arxiv.org/pdf/1811.00250.pdf)|
+
+**Quantization**
+
+|Name|Brief Introduction of Algorithm|
+|---|---|
 | [Naive Quantizer](./Quantizer.md#naive-quantizer) |  Quantize weights to default 8 bits |
 | [QAT Quantizer](./Quantizer.md#qat-quantizer) | Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. [Reference Paper](http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf)|
 | [DoReFa Quantizer](./Quantizer.md#dorefa-quantizer) | DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. [Reference Paper](https://arxiv.org/abs/1606.06160)|
@@ -24,25 +31,26 @@ We have provided two naive compression algorithms and three popular ones for use

 We use a simple example to show how to modify your trial code in order to apply the compression algorithms. Let's say you want to prune all weight to 80% sparsity with Level Pruner, you can add the following three lines into your code before training your model ([here](https://github.com/microsoft/nni/tree/master/examples/model_compress) is complete code).

-Tensorflow code
+PyTorch code

 ```python
-from nni.compression.tensorflow import LevelPruner
+from nni.compression.torch import LevelPruner
 config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(tf.get_default_graph(), config_list)
+pruner = LevelPruner(model, config_list)
 pruner.compress()
 ```

-PyTorch code
+Tensorflow code

 ```python
-from nni.compression.torch import LevelPruner
+from nni.compression.tensorflow import LevelPruner
 config_list = [{ 'sparsity': 0.8, 'op_types': ['default'] }]
-pruner = LevelPruner(model, config_list)
+pruner = LevelPruner(tf.get_default_graph(), config_list)
 pruner.compress()
 ```

-You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms.
+
+You can use other compression algorithms in the package of `nni.compression`. The algorithms are implemented in both PyTorch and Tensorflow, under `nni.compression.torch` and `nni.compression.tensorflow` respectively. You can refer to [Pruner](./Pruner.md) and [Quantizer](./Quantizer.md) for detail description of supported algorithms. Also if you want to use knowledge distillation, you can refer to [KDExample](../TrialExample/KDExample.md)

 The function call `pruner.compress()` modifies user defined model (in Tensorflow the model can be obtained with `tf.get_default_graph()`, while in PyTorch the model is the defined model class), and the model is modified with masks inserted. Then when you run the model, the masks take effect. The masks can be adjusted at runtime by the algorithms.


--- a/docs/en_US/TrialExample/KDExample.md
+++ b/docs/en_US/TrialExample/KDExample.md
+Knowledge Distillation on NNI Compressor
+===
+
+## KnowledgeDistill
+
+Knowledge distillation support, in [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531),  the compressed model is trained to mimic a pre-trained, larger model.  This training setting is also referred to as "teacher-student",  where the large model is the teacher and the small model is the student.
+
+![](../../img/distill.png)
+
+### Usage
+
+PyTorch code
+
+```python
+from knowledge_distill.knowledge_distill import KnowledgeDistill
+kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
+alpha = 1
+beta = 0.8
+for batch_idx, (data, target) in enumerate(train_loader):
+    data, target = data.to(device), target.to(device)
+    optimizer.zero_grad()
+    output = model(data)
+    loss = F.cross_entropy(output, target)
+    # you only to add the following line to fine-tune with knowledge distillation
+    loss = alpha * loss + beta * kd.loss(data=data, student_out=output)
+    loss.backward()
+```
+
+#### User configuration for KnowledgeDistill
+* **kd_teacher_model:** The pre-trained teacher model 
+* **kd_T:** Temperature for smoothing teacher model's output
+
+The complete code can be found here
\ No newline at end of file
--- a/docs/img/distill.png
+++ b/docs/img/distill.png
--- a/examples/model_compress/knowledge_distill/knowledge_distill.py
+++ b/examples/model_compress/knowledge_distill/knowledge_distill.py
+import logging
+import torch
+import torch.nn.functional as F
+
+_logger = logging.getLogger(__name__)
+
+
+class KnowledgeDistill():
+    """
+    Knowledge Distillaion support while fine-tuning the compressed model
+    Geoffrey Hinton, Oriol Vinyals, Jeff Dean
+    "Distilling the Knowledge in a Neural Network"
+    https://arxiv.org/abs/1503.02531
+    """
+
+    def __init__(self, teacher_model, kd_T=1):
+        """
+        Parameters
+        ----------
+        teacher_model : pytorch model
+            the teacher_model for teaching the student model, it should be pretrained
+        kd_T: float
+            kd_T is the temperature parameter, when kd_T=1 we get the standard softmax function
+            As kd_T grows, the probability distribution generated by the softmax function becomes softer
+        """
+
+        self.teacher_model = teacher_model
+        self.kd_T = kd_T
+
+    def _get_kd_loss(self, data, student_out, teacher_out_preprocess=None):
+        """
+        Parameters
+        ----------
+        data : torch.Tensor
+            the input training data
+        student_out: torch.Tensor
+            output of the student network
+        teacher_out_preprocess: function
+            a function for pre-processing teacher_model's output
+            e.g. when teacher_out_preprocess=lambda x:x[0]
+            extract teacher_model's output (tensor1, tensor2)->tensor1
+
+        Returns
+        -------
+        torch.Tensor
+            weighted distillation loss
+        """
+
+        with torch.no_grad():
+            kd_out = self.teacher_model(data)
+        if teacher_out_preprocess is not None:
+            kd_out = teacher_out_preprocess(kd_out)
+        assert type(kd_out) is torch.Tensor
+        assert type(student_out) is torch.Tensor
+        assert kd_out.shape == student_out.shape
+        soft_log_out = F.log_softmax(student_out / self.kd_T, dim=1)
+        soft_t = F.softmax(kd_out / self.kd_T, dim=1)
+        loss_kd = F.kl_div(soft_log_out, soft_t.detach(), reduction='batchmean')
+        return loss_kd
+
+    def loss(self, data, student_out):
+        """
+        Parameters
+        ----------
+        data : torch.Tensor
+            Input of the student model
+        student_out : torch.Tensor
+            Output of the student model
+
+        Returns
+        -------
+        torch.Tensor
+            Weighted loss of student loss and distillation loss
+        """
+        return self._get_kd_loss(data, student_out)
--- a/examples/model_compress/pruning_kd.py
+++ b/examples/model_compress/pruning_kd.py
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision import datasets, transforms
+from nni.compression.torch import L1FilterPruner
+from knowledge_distill.knowledge_distill import KnowledgeDistill
+
+
+class vgg(nn.Module):
+    def __init__(self, init_weights=True):
+        super(vgg, self).__init__()
+        cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512]
+        self.cfg = cfg
+        self.feature = self.make_layers(cfg, True)
+        num_classes = 10
+        self.classifier = nn.Sequential(
+            nn.Linear(cfg[-1], 512),
+            nn.BatchNorm1d(512),
+            nn.ReLU(inplace=True),
+            nn.Linear(512, num_classes)
+        )
+        if init_weights:
+            self._initialize_weights()
+
+    def make_layers(self, cfg, batch_norm=True):
+        layers = []
+        in_channels = 3
+        for v in cfg:
+            if v == 'M':
+                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
+            else:
+                conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1, bias=False)
+                if batch_norm:
+                    layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
+                else:
+                    layers += [conv2d, nn.ReLU(inplace=True)]
+                in_channels = v
+        return nn.Sequential(*layers)
+
+    def forward(self, x):
+        x = self.feature(x)
+        x = nn.AvgPool2d(2)(x)
+        x = x.view(x.size(0), -1)
+        y = self.classifier(x)
+        return y
+
+    def _initialize_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+                m.weight.data.normal_(0, math.sqrt(2. / n))
+                if m.bias is not None:
+                    m.bias.data.zero_()
+            elif isinstance(m, nn.BatchNorm2d):
+                m.weight.data.fill_(0.5)
+                m.bias.data.zero_()
+            elif isinstance(m, nn.Linear):
+                m.weight.data.normal_(0, 0.01)
+                m.bias.data.zero_()
+
+
+def train(model, device, train_loader, optimizer, kd=None):
+    alpha = 1
+    beta = 0.8
+    model.train()
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to(device), target.to(device)
+        optimizer.zero_grad()
+        output = model(data)
+        student_loss = F.cross_entropy(output, target)
+        if kd is not None:
+            kd_loss = kd.loss(data=data, student_out=output)
+            loss = alpha * student_loss + beta * kd_loss
+        else:
+            loss = student_loss
+        loss.backward()
+        optimizer.step()
+        if batch_idx % 100 == 0:
+            print('{:2.0f}%  Loss {}'.format(100 * batch_idx / len(train_loader), loss.item()))
+
+
+def test(model, device, test_loader):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to(device), target.to(device)
+            output = model(data)
+            test_loss += F.nll_loss(output, target, reduction='sum').item()
+            pred = output.argmax(dim=1, keepdim=True)
+            correct += pred.eq(target.view_as(pred)).sum().item()
+    test_loss /= len(test_loader.dataset)
+    acc = 100 * correct / len(test_loader.dataset)
+
+    print('Loss: {}  Accuracy: {}%)\n'.format(
+        test_loss, acc))
+    return acc
+
+
+def main():
+    torch.manual_seed(0)
+    device = torch.device('cuda')
+    train_loader = torch.utils.data.DataLoader(
+        datasets.CIFAR10('./data.cifar10', train=True, download=True,
+                         transform=transforms.Compose([
+                             transforms.Pad(4),
+                             transforms.RandomCrop(32),
+                             transforms.RandomHorizontalFlip(),
+                             transforms.ToTensor(),
+                             transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
+                         ])),
+        batch_size=64, shuffle=True)
+    test_loader = torch.utils.data.DataLoader(
+        datasets.CIFAR10('./data.cifar10', train=False, transform=transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
+        ])),
+        batch_size=200, shuffle=False)
+
+    model = vgg()
+    model.to(device)
+
+    # Train the base VGG-16 model
+    print('=' * 10 + 'Train the unpruned base model' + '=' * 10)
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
+    lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 160, 0)
+    for epoch in range(160):
+        print('# Epoch {} #'.format(epoch))
+        train(model, device, train_loader, optimizer)
+        test(model, device, test_loader)
+        lr_scheduler.step(epoch)
+    torch.save(model.state_dict(), 'vgg16_cifar10.pth')
+
+    # Test base model accuracy
+    print('=' * 10 + 'Test on the original model' + '=' * 10)
+    model.load_state_dict(torch.load('vgg16_cifar10.pth'))
+    test(model, device, test_loader)
+    # top1 = 93.51%
+
+    # Pruning Configuration, all convolution layers are pruned out 80% filters according to the L1 norm
+    configure_list = [{
+        'sparsity': 0.8,
+        'op_types': ['Conv2d'],
+    }]
+
+    # Prune model and test accuracy without fine tuning.
+    print('=' * 10 + 'Test on the pruned model before fine tune' + '=' * 10)
+    pruner = L1FilterPruner(model, configure_list)
+    model = pruner.compress()
+    test(model, device, test_loader)
+    # top1 = 10.00%
+
+    # Fine tune the pruned model for 40 epochs and test accuracy
+    print('=' * 10 + 'Fine tuning' + '=' * 10)
+    optimizer_finetune = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=1e-4)
+    best_top1 = 0
+    kd_teacher_model = vgg()
+    kd_teacher_model.to(device)
+    kd_teacher_model.load_state_dict(torch.load('vgg16_cifar10.pth'))
+    kd = KnowledgeDistill(kd_teacher_model, kd_T=5)
+    for epoch in range(40):
+        pruner.update_epoch(epoch)
+        print('# Epoch {} #'.format(epoch))
+        train(model, device, train_loader, optimizer_finetune, kd)
+        top1 = test(model, device, test_loader)
+        if top1 > best_top1:
+            best_top1 = top1
+            # Export the best model, 'model_path' stores state_dict of the pruned model,
+            # mask_path stores mask_dict of the pruned model
+            pruner.export_model(model_path='pruned_vgg16_cifar10.pth', mask_path='mask_vgg16_cifar10.pth')
+
+    # Test the exported model
+    print('=' * 10 + 'Test on the pruned model after fine tune' + '=' * 10)
+    new_model = vgg()
+    new_model.to(device)
+    new_model.load_state_dict(torch.load('pruned_vgg16_cifar10.pth'))
+    test(new_model, device, test_loader)
+    # top1 = 85.43% with kd, top1 = 85.04% without kd,
+
+
+if __name__ == '__main__':
+    main()